mpg %>%
ggplot(aes(displ,hwy, color = fl))+
geom_point()+
facet_grid(drv~cyl)
What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mpg %>%
ggplot(aes(displ,hwy))+
geom_point(aes(color = class))+
geom_smooth(data = filter(mpg,class == "compact"), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mpg %>%
ggplot(aes(displ,hwy))+
geom_point()+
geom_smooth(aes(line = drv),se = FALSE)
## Warning: Ignoring unknown aesthetics: line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mpg %>%
ggplot(aes(displ,hwy))+
geom_point(aes(color = drv))+
geom_smooth(aes(linetype = drv), se = FALSE)+
theme_light()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
diamonds %>%
ggplot(aes(cut))+
geom_bar()
Note what happens if you map the fill aesthetic to another vari‐ able, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity:
diamonds %>%
ggplot()+
geom_bar(aes(cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked barchart, you can use one of three other options: “identity”, “dodge” or “fill”:
• position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA:
diamonds %>%
ggplot()+
geom_bar(aes(cut, fill = clarity),alpha = 1/5, position = "identity")
diamonds %>%
ggplot()+
geom_bar(aes(cut, fill = clarity),alpha = 1/5, position = "identity")
ggplot(
data = diamonds,
mapping = aes(x = cut, color = clarity)
) +
geom_bar(fill = NA, position = "dodge")
mpg %>%
ggplot(aes(displ,cty))+
geom_point()+
geom_jitter()+
labs(subtitle = "This plot has combined both geom_point and geom_jitter aesthetics (in mpg data)")
mpg %>%
ggplot(aes(displ,cty))+
geom_jitter()+
labs(subtitle = "This plot has only used geom_jitter aesthetic (in mpg data)")
mpg %>%
ggplot(aes(displ,cty))+
geom_point()+
labs(subtitle = "This plot has only used geom_point aesthetic (in mpg data)")
mpg %>%
ggplot(aes(cty, hwy)) +
geom_point()
### Solution: the above plot suffer the issue of overplotting. It is not apparently showing all the observation in the same plot. it can be corrected like the following, by adding to the plot
geom_jitter() aesthetic
mpg %>%
ggplot(aes(cty, hwy)) +
geom_point()+
geom_jitter()
you can also use geom_count() to correct the above issue of overplotting:
mpg %>%
ggplot(aes(cty, hwy)) +
geom_point()+
geom_count()
when geom_count() is used with aesthetic color, it provide beautifull colors by addressing the overplotting issues.
mpg %>%
ggplot(aes(cty, hwy)) +
geom_point()+
geom_count(aes(color = class))
Learning how to use geom_boxplot() aesthetic
mpg %>%
ggplot(aes(x=reorder(class,hwy,FUN = median), hwy)) +
geom_boxplot()+
coord_flip()
* in the above plot I have added the reorder function which will be learnt soon in the Exploratory analysis chapter.
Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate sys‐tems that are occasionally helpful:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
*coord_flip() switches the x- and y-axes. This is useful (forexample) if you want horizontal boxplots. It’s also useful for long labels—it’s hard to get them to fit without overlapping onthe x-axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()+
coord_flip()
diamonds %>%
ggplot(aes(cut, fill = cut))+
geom_bar(show.legend = FALSE, width = 1)+
coord_flip()+
coord_polar()+
labs(y=NULL,x=NULL)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
### Exercises 4. What does the following plot tell you about the relationship between city and highway mpg? Why is
coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
if(require(nycflights13)){
library(tidyverse)
library(nycflights13)
}
## Loading required package: nycflights13
## Warning: package 'nycflights13' was built under R version 3.6.3
filter() functionFinding all the flights that departed on the 1st January
jan1<-flights %>%
filter(month==1, day ==1)
Finding all the flights that departed in both November or December 2013
# one way is to use the following code
flights %>%
filter(month %in% c(11,12)) %>%
View()
# or use the following one:
flights %>%
filter (month ==11 | month ==12)
## # A tibble: 55,403 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 1 5 2359 6 352 345
## 2 2013 11 1 35 2250 105 123 2356
## 3 2013 11 1 455 500 -5 641 651
## 4 2013 11 1 539 545 -6 856 827
## 5 2013 11 1 542 545 -3 831 855
## 6 2013 11 1 549 600 -11 912 923
## 7 2013 11 1 550 600 -10 705 659
## 8 2013 11 1 554 600 -6 659 701
## 9 2013 11 1 554 600 -6 826 827
## 10 2013 11 1 554 600 -6 749 751
## # ... with 55,393 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Finding all the flights that delayed more than two hours on both departure and arrival
flights %>%
filter(arr_delay >= 120 & dep_delay >= 120)
## # A tibble: 8,482 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 848 1835 853 1001 1950
## 2 2013 1 1 957 733 144 1056 853
## 3 2013 1 1 1114 900 134 1447 1222
## 4 2013 1 1 1815 1325 290 2120 1542
## 5 2013 1 1 1842 1422 260 1958 1535
## 6 2013 1 1 1856 1645 131 2212 2005
## 7 2013 1 1 1934 1725 129 2126 1855
## 8 2013 1 1 1938 1703 155 2109 1823
## 9 2013 1 1 1942 1705 157 2124 1830
## 10 2013 1 1 2006 1630 216 2230 1848
## # ... with 8,472 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Finding the flights that did not delay 2 hours or above on both departure and arrival time
flights %>%
filter(!(arr_delay >= 120 | dep_delay >= 120)) %>%
View()
# a) Had an arrival delay of two or more hours
flights %>%
filter(arr_delay >= 120)
## # A tibble: 10,200 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 848 1835 853 1001 1950
## 3 2013 1 1 957 733 144 1056 853
## 4 2013 1 1 1114 900 134 1447 1222
## 5 2013 1 1 1505 1310 115 1638 1431
## 6 2013 1 1 1525 1340 105 1831 1626
## 7 2013 1 1 1549 1445 64 1912 1656
## 8 2013 1 1 1558 1359 119 1718 1515
## 9 2013 1 1 1732 1630 62 2028 1825
## 10 2013 1 1 1803 1620 103 2008 1750
## # ... with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# b. Flew to Houston (IAH or HOU)
flights %>%
filter(dest == "HOU" | dest == "IAH")
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#c. Were operated by United, American, or Delta
flights %>%
filter(carrier %in% c("AA","UA","DL"))
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in Summer (July, August and September )
flights %>%
filter(month %in% c(7,8,9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# e. Arrived more than two hours late, but didn’t leave late
flights %>%
filter(arr_delay>120 & dep_delay <=0)
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 27 1419 1420 -1 1754 1550
## 2 2013 10 7 1350 1350 0 1736 1526
## 3 2013 10 7 1357 1359 -2 1858 1654
## 4 2013 10 16 657 700 -3 1258 1056
## 5 2013 11 1 658 700 -2 1329 1015
## 6 2013 3 18 1844 1847 -3 39 2219
## 7 2013 4 17 1635 1640 -5 2049 1845
## 8 2013 4 18 558 600 -2 1149 850
## 9 2013 4 18 655 700 -5 1213 950
## 10 2013 5 22 1827 1830 -3 2217 2010
## # ... with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#f. Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>%
filter(dep_delay >= 60 & air_time)
## # A tibble: 26,802 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 826 715 71 1136 1045
## 3 2013 1 1 848 1835 853 1001 1950
## 4 2013 1 1 957 733 144 1056 853
## 5 2013 1 1 1114 900 134 1447 1222
## 6 2013 1 1 1120 944 96 1331 1213
## 7 2013 1 1 1301 1150 71 1518 1345
## 8 2013 1 1 1337 1220 77 1649 1531
## 9 2013 1 1 1400 1250 70 1645 1502
## 10 2013 1 1 1505 1310 115 1638 1431
## # ... with 26,792 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Below is an example on how to filter the flights that departed in Summer (July, August and September)
flights %>%
filter(between(month,7,9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(is.na(dep_time))
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # ... with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Arrange()arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
# arranging the flights by descending order(month, day and dep_time)
flights %>%
arrange(-month,-day, -dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 31 2356 2359 -3 436 445
## 2 2013 12 31 2355 2359 -4 430 440
## 3 2013 12 31 2332 2245 47 58 3
## 4 2013 12 31 2328 2330 -2 412 409
## 5 2013 12 31 2321 2250 31 46 8
## 6 2013 12 31 2310 2255 15 7 2356
## 7 2013 12 31 2245 2250 -5 2359 2356
## 8 2013 12 31 2235 2245 -10 2351 2355
## 9 2013 12 31 2218 2219 -1 315 304
## 10 2013 12 31 2211 2159 12 100 45
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# arrange the flights by descending order of their dep_time
flights %>%
arrange(-dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 30 2400 2359 1 327 337
## 2 2013 11 27 2400 2359 1 515 445
## 3 2013 12 5 2400 2359 1 427 440
## 4 2013 12 9 2400 2359 1 432 440
## 5 2013 12 9 2400 2250 70 59 2356
## 6 2013 12 13 2400 2359 1 432 440
## 7 2013 12 19 2400 2359 1 434 440
## 8 2013 12 29 2400 1700 420 302 2025
## 9 2013 2 7 2400 2359 1 432 436
## 10 2013 2 7 2400 2359 1 443 444
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# or you can also use desc() to arrange by descending order
flights %>%
arrange(desc(dep_time))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 30 2400 2359 1 327 337
## 2 2013 11 27 2400 2359 1 515 445
## 3 2013 12 5 2400 2359 1 427 440
## 4 2013 12 9 2400 2359 1 432 440
## 5 2013 12 9 2400 2250 70 59 2356
## 6 2013 12 13 2400 2359 1 432 440
## 7 2013 12 19 2400 2359 1 434 440
## 8 2013 12 29 2400 1700 420 302 2025
## 9 2013 2 7 2400 2359 1 432 436
## 10 2013 2 7 2400 2359 1 443 444
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
arrange(df,desc(x))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
(Hint: use is.na().)## arrange the flights data by descending order dep_time with consideration of starting with missing values
flights %>%
arrange(desc(is.na(dep_time)),-dep_time) %>%
View()
## arrange the variable x by starting with missing values
arrange(df,desc(is.na(x)))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 NA
## 2 5
## 3 2
# the most delayed flight are sorted as the following
flights %>%
arrange(-dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# or as the following
flights %>%
arrange(desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The most delayed flight was HA 51, JFK to HNL, which was scheduled to leave on January 09, 2013 09:00. Note that the departure time is given as 641, which seems to be less than the scheduled departure time. But the departure was delayed 1,301 minutes, which is 21 hours, 41 minutes. The departure time is the day after the scheduled departure time.
The flights that left the earliest are arranged like the following:
## the flights that left the earliest are arranged like the following:
flights %>%
arrange(dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Flight B6 97 (JFK to DEN) scheduled to depart on December 07, 2013 at 21:23 departed 43 minutes early.
# for the first scenario, one would solve like the following (which is easiest)
flights %>%
arrange(air_time) %>%
View()
the fastest flight took only 20 minutes of air_time. that is EV 4368 from EWR to BDL destination. But it had only the distance of 116 miles. This doesn’t mean that It was the fastest in terms of speed and distance covered. Thus the proper would to above issue would need to use mutate() which is not so far covered.
# we will need to calculate the flights' speed which is currently unavailable in the dataset, by using mutate function. speed will be measured in miles/minutes
flights %>%
mutate(speed = distance/air_time)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # speed <dbl>
# Then we will be calculating the flight which used the highest speed and thus used the smallest air_time compared to the covered distance.
flights %>%
mutate(speed = distance/air_time) %>%
arrange(desc(speed))%>%
View()
Finally the proper answer would be the following: The flight (DL 1499) from LGA to ATL destination (departed on 25th May,2013 by 17:09) was the fastest flight whereby it used only 1 hour and 5 minutes to cover 762 miles.
select()