In this module, we will be using the dplyr
package to
perform data transformation in R. We will use a new data set on flights
departing New York City in 2013. You must install the package
nycflights13
to make it available.
install.packages("nycflights13")
library(tidyverse)
library(nycflights13)
nycflights13
data setTo explore the basic data manipulation verbs of dplyr, we’ll use
nycflights13::flights
. This data frame contains all 336,776
flights that departed from New York City in 2013. The data comes from
the US Bureau
of Transportation Statistics, and is documented in
?flights
.
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Please check the meaning of each variable from the help document.
To summarize the data types in R so far:
int
stands for integers.dbl
stands for doubles, or real numbers.chr
stands for character vectors, or strings.dttm
stands for date-times (a date + a time).lgl
stands for logical, vectors that contain only
TRUE
or FALSE
.fct
stands for factors, which R uses to represent
categorical variables with fixed possible values.date
stands for dates.filter()
).arrange()
).select()
).mutate()
).summarise()
).group_by()
filter()
filter()
allows you to subset observations based on
their values. For example, to pick up the flights on January 1st, we may
do the following:
filter(flights, month == 1, day == 1)
## # A tibble: 842 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
<-
:To use that filtered data, we need to assign it to a new name with
the assignment <-
. (Later we will learn to use a pipe to
simplify this process.)
jan1 <- filter(flights, month == 1, day == 1)
jan1
## # A tibble: 842 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
fliter()
fliter(<DATA_NAME>, ...<LOGICAL CONDITIONS>...)
Everything after data_name
should be an expression
returning some logical value. When multiple conditions are given, then
rows that satisfy all conditions (every condition is TRUE
)
will be retained.
fliter()
function supports all logical comparisons to
fliter our data using the following operators:
>
, >=
, <
,
<=
, !=
(not equal to), and ==
(equal to).
=
filter(flights, month = 1)
## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 1`?
==
when dealing with
float:When handling float numbers, it is usually not a good idea to use
==
since there is machine error for floats and even
mathmatically true equations may turn out to be FALSE
.
sqrt(2) ^ 2 == 2
## [1] FALSE
1/49 * 49 == 1
## [1] FALSE
Instead, one should use near()
instead to check
equations about float numbers.
near(sqrt(2) ^ 2, 2)
## [1] TRUE
near(1 / 49 * 49, 1)
## [1] TRUE
Use Boolean operators with filter()
: &
is “and”, |
is “or”, and !
is “not”.
Complete set of boolean operations. x
is the left-hand
circle, y
is the right-hand circle, and the shaded region
show which parts each operator selects.
filter(flights, month == 11 | month == 12)
## # A tibble: 55,403 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 11 1 5 2359 6 352 345 7 B6
## 2 2013 11 1 35 2250 105 123 2356 87 B6
## 3 2013 11 1 455 500 -5 641 651 -10 US
## 4 2013 11 1 539 545 -6 856 827 29 UA
## 5 2013 11 1 542 545 -3 831 855 -24 AA
## 6 2013 11 1 549 600 -11 912 923 -11 UA
## 7 2013 11 1 550 600 -10 705 659 6 US
## 8 2013 11 1 554 600 -6 659 701 -2 US
## 9 2013 11 1 554 600 -6 826 827 -1 DL
## 10 2013 11 1 554 600 -6 749 751 -2 DL
## # … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
filter(flights, month == (11 | 12))
This is because (11 | 12)
is equivalent to evaluate
TRUE|TRUE
(any non zero value is considered
TRUE
in logical expressions) which returns
TRUE
. Then TRUE
is interpreted as
1
to match the numeric variable month
. So this
finds all flights in January instead of November and December!
x %in% y
.%in%
is equivalent to the in
keyword in
Python. It can conveniently replace |
when there are many
options. The following code finds flights that depart in May, July or
November.
nov_dec <- filter(flights, month %in% c(5, 7, 11))
&&
and ||
, butDon’t use them here in filter()
!
between()
helper function for (X > a) & (X
< b)For filtering a value between two values, for example, depature time
between 0:00am and 6:00am, we can use the between
function:
filter(flights, between(dep_time, 0000, 0600))
## # A tibble: 9,344 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 9,334 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
JFK
and land on
ORD
or CVG
(Chicago or Dallas).NA
Missing values are represented as NA
s (“not available”)
in R. NA
by itself is a constant in R, and can be of any
type depending on the type of other available data.
NA
are “contagious” in mathematical operations:
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
FALSE
?NA == NA
## [1] NA
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
## [1] NA
# We don't know!
is.na()
to filter out NA
in your data
setIf you want to determine if a value is missing, use
is.na()
:
is.na(x)
## [1] TRUE
filter()
excludes both FALSE
and
NA
values.filter(flights, is.na(dep_time))
## # A tibble: 8,255 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 NA 1630 NA NA 1815 NA EV
## 2 2013 1 1 NA 1935 NA NA 2240 NA AA
## 3 2013 1 1 NA 1500 NA NA 1825 NA AA
## 4 2013 1 1 NA 600 NA NA 901 NA B6
## 5 2013 1 2 NA 1540 NA NA 1747 NA EV
## 6 2013 1 2 NA 1620 NA NA 1746 NA EV
## 7 2013 1 2 NA 1355 NA NA 1459 NA EV
## 8 2013 1 2 NA 1420 NA NA 1644 NA EV
## 9 2013 1 2 NA 1321 NA NA 1536 NA EV
## 10 2013 1 2 NA 1545 NA NA 1910 NA AA
## # … with 8,245 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
filter(flights, dep_time > 1900)
## # A tibble: 54,702 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 1904 1905 -1 2139 2227 -48 DL
## 2 2013 1 1 1904 1905 -1 2157 2208 -11 UA
## 3 2013 1 1 1905 1900 5 2311 2301 10 DL
## 4 2013 1 1 1906 1915 -9 2211 2244 -33 DL
## 5 2013 1 1 1909 1912 -3 2239 2237 2 B6
## 6 2013 1 1 1909 1910 -1 2212 2224 -12 DL
## 7 2013 1 1 1910 1909 1 2126 2046 40 EV
## 8 2013 1 1 1910 1910 0 2126 2107 19 9E
## 9 2013 1 1 1910 1855 15 2118 2103 15 US
## 10 2013 1 1 1911 1910 1 2050 2055 -5 MQ
## # … with 54,692 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The first result shows that there are missing values in actual
departure time, and the second result shows that when we do filtering,
NA
values are dropped.
Let’s exercise what we just learned combining data transformation and
data visualization. One routine practice is to remove samples with
NA
values before making the graph.
For example, if we want to create a histogram of
dep_time
, we can directly make the graph and
ggplot
will smartly throw away the NA
values.
A warning message would pop out to hint us with this fact.
ggplot(flights) + geom_histogram(aes(dep_time))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
But sometimes we hope to know information about those flights with
NA
values. For example, NA
in
dep_time
means that the flight has been canceled (therefore
no departure time). If we hope to know which airline company canceled
the most flights in our data set, we can do the following:
canceled_flight <- filter(flights, is.na(dep_time))
ggplot(canceled_flight) +
geom_bar(aes(carrier)) +
labs(title = "Canceled Flights by Carrier",
x = "Carrier (in abbreviation)",
y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
EV
canceled the most flights in
our data set. After checking airlines
commands we see that
it is ExpressJet Airlines.For EV
, during which month were most flights
canceled?
select()
The second function to study here is the select()
function, which select variables from the original data set and form a
new one.
There are a few ways to use select()
, as shown
below.
# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 × 16
## dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
## <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
## # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
## # ⁵sched_arr_time, ⁶arr_delay
# Select by col number
select(flights, 1:3)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
Similarly to filter()
, if you want to save the data
after selecting, we must either name it or use the pipe (discussed
later).
new_data <- select(flights, year, month, day)
starts_with(), ends_with(), contains()
There are a number of helper functions you can use within
select()
:
* `starts_with("abc")`: matches names that begin with "abc".
* `ends_with("xyz")`: matches names that end with "xyz".
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression.
* `num_range("x", 1:3)` matches `x1`, `x2` and `x3`.
* See `?select` for more details.
# Select columns of year, month, day and those starting with "dep" or "arr"
dep_data = select(flights, year, month, day, starts_with(c("dep", "arr")))
glimpse(dep_data)
## Rows: 336,776
## Columns: 7
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, …
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31…
This may not look useful for a data set with only dozens of columns. But when you have hundreds of columns or more, this can be very helpful.
rename()
to rename a columnSometimes we hope to rename a column for convenience. We can do the following:
rename(flights, sch_dep = sched_dep_time) # new_name = old_name
## # A tibble: 336,776 × 19
## year month day dep_time sch_dep dep_delay arr_time sched…¹ arr_d…² carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_arr_time, ²arr_delay
everything()
to reorder the columnsAnother option is to use select()
in conjunction with
the everything()
helper. This is useful if you have a
handful of variables you’d like to move to the start of the data
frame.
For example, if we want to move time_hour
and
air_time
to the first two columns and keep all others. We
can do
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 × 19
## time_hour air_t…¹ year month day dep_t…² sched…³ dep_d…⁴ arr_t…⁵
## <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, 10 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## # variable names ¹air_time, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵arr_time
arrange()
Next, we will study the arrange()
function which changes
row order by sorting the values in one or more given columns.
For example, the original data set does not completely follow the order of time since month 10, 11, 12 goes right after month 1. To resolve this, we can rearrange the rows by month and day.
flights_md <- arrange(flights, month, day)
Similar to all previous functions, the first argument is the name of data set, followed by column names.
desc()
By default, the arrange()
function sorting out values
from the smallest to the largest. To reverse it, we can simply put an
desc
(representing “descending”) function on the column
name.
for example, we hope to see flights with the worst delay in departure, we can do the following:
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL
## 8 2013 6 27 959 1900 899 1236 2226 850 DL
## 9 2013 7 22 2257 759 898 121 1026 895 DL
## 10 2013 12 5 756 1700 896 1058 2020 878 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We see that the worst one delayed by 1301 minutes, which is over 21 hours!
arrange()
function can be very useful when we hope to
quickly check the minimum and maximum values of any column.
(df <- tibble(x = c(5, 2, NA)))
## # A tibble: 3 × 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
arrange(df, x)
## # A tibble: 3 × 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 × 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
select()
and
arrange()
When there are too many columns, we may hope to make the arranged
column the first one. For example, if we arrange the data set by
air_time
, then we can do the following:
select(arrange(flights, air_time), flight, month, day, year, air_time)
## # A tibble: 336,776 × 5
## flight month day year air_time
## <int> <int> <int> <int> <dbl>
## 1 4368 1 16 2013 20
## 2 4631 4 13 2013 20
## 3 4276 12 6 2013 21
## 4 4619 2 3 2013 21
## 5 4368 2 5 2013 21
## 6 4619 2 12 2013 21
## 7 2132 3 2 2013 21
## 8 3650 3 8 2013 21
## 9 4118 3 18 2013 21
## 10 4276 3 19 2013 21
## # … with 336,766 more rows
Let’s do another analysis example. Let’s try to analyze what is the prominent factor that causes very long delays. First, let’s separate our data set into two parts - those with long delay (more than 2 hours) and all other flights.
long_delay <- filter(flights, dep_delay >= 120 | arr_delay >= 120)
others <- filter(flights, dep_delay < 120 & arr_delay < 120)
So we have 11606 observations in long_delay
. That’s
about 3.4% of the whole data set.
nrow(long_delay)/nrow(flights)
## [1] 0.03446208
Let’s see whether carrier is a big factor here:
ggplot(long_delay) +
geom_bar(aes(carrier, y = after_stat(count/sum(count)))) +
labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
ggplot(others) +
geom_bar(aes(carrier, y = after_stat(count/sum(count)))) +
labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We can see that the distributions of carriers are quite different between the two graphs. There are some carriers with significantly higher proportion of flights with long delays - AA, EV, UA, US etc.