Part 1 : Explore

Chapter 1: Data Visualization with ggplot 2

Learning how aesthetics mapping in ggplot2 work

mpg %>%
  ggplot(aes(displ,hwy, color = fl))+
  geom_point()+
  facet_grid(drv~cyl)

4. Take the first faceted plot in this section:

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))

To display multiple geoms in the same plot, add multiple geom functions to ggplot():

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

mpg %>%
  ggplot(aes(displ,hwy))+
  geom_point(aes(color = class))+
  geom_smooth(data = filter(mpg,class == "compact"), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

6. Re-create the R code necessary to generate the following graphs.

mpg %>%
  ggplot(aes(displ,hwy))+
  geom_point()+
  geom_smooth(aes(line = drv),se = FALSE)
## Warning: Ignoring unknown aesthetics: line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

mpg %>%
  ggplot(aes(displ,hwy))+
  geom_point(aes(color = drv))+
  geom_smooth(aes(linetype = drv), se = FALSE)+
  theme_light()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

analyzing the diamonds data by learning how geom_bar() works

diamonds %>%
  ggplot(aes(cut))+
  geom_bar()

Note what happens if you map the fill aesthetic to another vari‐ able, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity:

diamonds %>%
  ggplot()+
  geom_bar(aes(cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked barchart, you can use one of three other options: “identity”, “dodge” or “fill”:

• position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA:

diamonds %>%
  ggplot()+
  geom_bar(aes(cut, fill = clarity),alpha = 1/5, position = "identity")

diamonds %>%
  ggplot()+
  geom_bar(aes(cut, fill = clarity),alpha = 1/5, position = "identity")

ggplot(
data = diamonds,
mapping = aes(x = cut, color = clarity)
) +
geom_bar(fill = NA, position = "dodge")

learning how to use geom_jitter() easthetic

mpg %>%
  ggplot(aes(displ,cty))+
  geom_point()+
  geom_jitter()+
  labs(subtitle = "This plot has combined both geom_point and geom_jitter aesthetics (in mpg data)")

mpg %>%
  ggplot(aes(displ,cty))+
  geom_jitter()+
  labs(subtitle = "This plot has only used geom_jitter aesthetic (in mpg data)")

mpg %>%
  ggplot(aes(displ,cty))+
  geom_point()+
  labs(subtitle = "This plot has only used geom_point aesthetic (in mpg data)")

Exercises

  1. What is wrong with the following plot?
mpg %>%
  ggplot(aes(cty, hwy)) +
  geom_point()

### Solution: the above plot suffer the issue of overplotting. It is not apparently showing all the observation in the same plot. it can be corrected like the following, by adding to the plot geom_jitter() aesthetic

mpg %>%
  ggplot(aes(cty, hwy)) +
  geom_point()+
  geom_jitter()

you can also use geom_count() to correct the above issue of overplotting:

mpg %>%
  ggplot(aes(cty, hwy)) +
  geom_point()+
  geom_count()

when geom_count() is used with aesthetic color, it provide beautifull colors by addressing the overplotting issues.

mpg %>%
  ggplot(aes(cty, hwy)) +
  geom_point()+
  geom_count(aes(color = class))

Learning how to use geom_boxplot() aesthetic

mpg %>%
  ggplot(aes(x=reorder(class,hwy,FUN = median), hwy)) +
  geom_boxplot()+
  coord_flip()

* in the above plot I have added the reorder function which will be learnt soon in the Exploratory analysis chapter.

Coordinates Systems

Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate sys‐tems that are occasionally helpful:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

*coord_flip() switches the x- and y-axes. This is useful (forexample) if you want horizontal boxplots. It’s also useful for long labels—it’s hard to get them to fit without overlapping onthe x-axis:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()+
  coord_flip()

How to use Coord_polar() for better visualization?

diamonds %>%
  ggplot(aes(cut, fill = cut))+
  geom_bar(show.legend = FALSE, width = 1)+
  coord_flip()+
  coord_polar()+
  labs(y=NULL,x=NULL)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

### Exercises 4. What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()

Chapter 3: Data Transformation with dplyr

loading the required packages and data for this chapter (if they are not yet loaded)

if(require(nycflights13)){
  library(tidyverse)
  library(nycflights13)
}
## Loading required package: nycflights13
## Warning: package 'nycflights13' was built under R version 3.6.3

using filter() function

Finding all the flights that departed on the 1st January

jan1<-flights %>%
  filter(month==1, day ==1)

Finding all the flights that departed in both November or December 2013

# one way is to use the following code
flights %>%
  filter(month %in% c(11,12)) %>%
  View()

# or use the following one:
flights %>%
  filter (month ==11 | month ==12)
## # A tibble: 55,403 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # ... with 55,393 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Finding all the flights that delayed more than two hours on both departure and arrival

flights %>%
  filter(arr_delay >= 120 & dep_delay >= 120)
## # A tibble: 8,482 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      848           1835       853     1001           1950
##  2  2013     1     1      957            733       144     1056            853
##  3  2013     1     1     1114            900       134     1447           1222
##  4  2013     1     1     1815           1325       290     2120           1542
##  5  2013     1     1     1842           1422       260     1958           1535
##  6  2013     1     1     1856           1645       131     2212           2005
##  7  2013     1     1     1934           1725       129     2126           1855
##  8  2013     1     1     1938           1703       155     2109           1823
##  9  2013     1     1     1942           1705       157     2124           1830
## 10  2013     1     1     2006           1630       216     2230           1848
## # ... with 8,472 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Finding the flights that did not delay 2 hours or above on both departure and arrival time

flights %>%
  filter(!(arr_delay >= 120 | dep_delay >= 120)) %>%
  View()

Exercises

  1. Find all flights that:
  1. Had an arrival delay of two or more hours
  2. Flew to Houston (IAH or HOU)
  3. Were operated by United, American, or Delta
 # a)  Had an arrival delay of two or more hours
flights %>%
  filter(arr_delay >= 120)
## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # ... with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# b. Flew to Houston (IAH or HOU)
flights %>%
  filter(dest == "HOU" | dest == "IAH")
## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#c. Were operated by United, American, or Delta
flights %>%
  filter(carrier %in% c("AA","UA","DL"))
## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in Summer (July, August and September )
flights %>%
  filter(month %in% c(7,8,9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# e. Arrived more than two hours late, but didn’t leave late
flights %>%
  filter(arr_delay>120 & dep_delay <=0)
## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # ... with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#f. Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>%
  filter(dep_delay >= 60 & air_time)
## # A tibble: 26,802 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      826            715        71     1136           1045
##  3  2013     1     1      848           1835       853     1001           1950
##  4  2013     1     1      957            733       144     1056            853
##  5  2013     1     1     1114            900       134     1447           1222
##  6  2013     1     1     1120            944        96     1331           1213
##  7  2013     1     1     1301           1150        71     1518           1345
##  8  2013     1     1     1337           1220        77     1649           1531
##  9  2013     1     1     1400           1250        70     1645           1502
## 10  2013     1     1     1505           1310       115     1638           1431
## # ... with 26,792 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

’Between()` is another useful dplyr helper that is used simply some of the above code in the filtering..

Below is an example on how to filter the flights that departed in Summer (July, August and September)

flights %>%
  filter(between(month,7,9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
flights %>%
  filter(is.na(dep_time))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # ... with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. Arrange rows with Arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

# arranging the flights by descending order(month, day and dep_time)
flights %>%
  arrange(-month,-day, -dep_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    31     2356           2359        -3      436            445
##  2  2013    12    31     2355           2359        -4      430            440
##  3  2013    12    31     2332           2245        47       58              3
##  4  2013    12    31     2328           2330        -2      412            409
##  5  2013    12    31     2321           2250        31       46              8
##  6  2013    12    31     2310           2255        15        7           2356
##  7  2013    12    31     2245           2250        -5     2359           2356
##  8  2013    12    31     2235           2245       -10     2351           2355
##  9  2013    12    31     2218           2219        -1      315            304
## 10  2013    12    31     2211           2159        12      100             45
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# arrange the flights by descending order of their dep_time
flights %>%
  arrange(-dep_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    30     2400           2359         1      327            337
##  2  2013    11    27     2400           2359         1      515            445
##  3  2013    12     5     2400           2359         1      427            440
##  4  2013    12     9     2400           2359         1      432            440
##  5  2013    12     9     2400           2250        70       59           2356
##  6  2013    12    13     2400           2359         1      432            440
##  7  2013    12    19     2400           2359         1      434            440
##  8  2013    12    29     2400           1700       420      302           2025
##  9  2013     2     7     2400           2359         1      432            436
## 10  2013     2     7     2400           2359         1      443            444
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# or you can also use desc() to arrange by descending order
flights %>%
  arrange(desc(dep_time))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    30     2400           2359         1      327            337
##  2  2013    11    27     2400           2359         1      515            445
##  3  2013    12     5     2400           2359         1      427            440
##  4  2013    12     9     2400           2359         1      432            440
##  5  2013    12     9     2400           2250        70       59           2356
##  6  2013    12    13     2400           2359         1      432            440
##  7  2013    12    19     2400           2359         1      434            440
##  8  2013    12    29     2400           1700       420      302           2025
##  9  2013     2     7     2400           2359         1      432            436
## 10  2013     2     7     2400           2359         1      443            444
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • not that minus in the above code serves to arrange by descending order in the mentioned variable

Missing values are always sorted at the end:

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA
arrange(df,desc(x))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na().)
## arrange the flights data by descending order dep_time with consideration of starting with missing values
flights %>%
  arrange(desc(is.na(dep_time)),-dep_time) %>%
  View()
## arrange the variable x by starting with missing values
arrange(df,desc(is.na(x)))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1    NA
## 2     5
## 3     2
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
# the most delayed flight are sorted as the following
flights %>%
  arrange(-dep_delay)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# or as the following
flights %>%
  arrange(desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The most delayed flight was HA 51, JFK to HNL, which was scheduled to leave on January 09, 2013 09:00. Note that the departure time is given as 641, which seems to be less than the scheduled departure time. But the departure was delayed 1,301 minutes, which is 21 hours, 41 minutes. The departure time is the day after the scheduled departure time.

The flights that left the earliest are arranged like the following:

## the flights that left the earliest are arranged like the following:
flights %>%
  arrange(dep_delay)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Flight B6 97 (JFK to DEN) scheduled to depart on December 07, 2013 at 21:23 departed 43 minutes early.

  1. Sort flights to find the fastest flights. There are two way to interpret this question:
  1. the flight that took the short period time (air_time)
  2. the flight that had the highest speed thus completed the distance to the destination in short time.
# for the first scenario, one would solve like the following (which is easiest)
flights %>%
  arrange(air_time) %>%
  View()

the fastest flight took only 20 minutes of air_time. that is EV 4368 from EWR to BDL destination. But it had only the distance of 116 miles. This doesn’t mean that It was the fastest in terms of speed and distance covered. Thus the proper would to above issue would need to use mutate() which is not so far covered.

# we will need to calculate the flights' speed which is currently unavailable in the dataset, by using mutate function. speed will be measured in miles/minutes
flights %>%
  mutate(speed = distance/air_time)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   speed <dbl>
# Then we will be calculating the flight which used the highest speed and thus used the smallest air_time compared to the covered distance.

flights %>%
  mutate(speed = distance/air_time) %>%
  arrange(desc(speed))%>%
  View()

Finally the proper answer would be the following: The flight (DL 1499) from LGA to ATL destination (departed on 25th May,2013 by 17:09) was the fastest flight whereby it used only 1 hour and 5 minutes to cover 762 miles.

4. Select Columns with select()