this assignment was completed without code from the solution
walkthrough, which means it was conducted using code by my version. so
the process to reach the results might be slightly different from the
solution code by Markho(the instructor).
the code documentation are in my github
load data & library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df <- hflights::hflights
class(df)
## [1] "data.frame"
paste("number of rows:", nrow(df)); paste("number of columns:", ncol(df))
## [1] "number of rows: 227496"
## [1] "number of columns: 21"
df %>% count(UniqueCarrier)
## UniqueCarrier n
## 1 AA 3244
## 2 AS 365
## 3 B6 695
## 4 CO 70032
## 5 DL 2641
## 6 EV 2204
## 7 F9 838
## 8 FL 2139
## 9 MQ 4648
## 10 OO 16061
## 11 UA 2072
## 12 US 4082
## 13 WN 45343
## 14 XE 73053
## 15 YV 79
another way that only display the distinct carriers name
df %>% distinct(UniqueCarrier)
## UniqueCarrier
## 1 AA
## 2 AS
## 3 B6
## 4 CO
## 5 DL
## 6 OO
## 7 UA
## 8 US
## 9 WN
## 10 EV
## 11 F9
## 12 FL
## 13 MQ
## 14 XE
## 15 YV
paste("there are about", nrow(df %>% distinct(Dest)), "dest airport are involved") ; paste("there are about", nrow(df %>% distinct(Origin)), "origin airport are involved")
## [1] "there are about 116 dest airport are involved"
## [1] "there are about 2 origin airport are involved"
df %>% count(Cancelled) %>% filter(Cancelled == 1)
## Cancelled n
## 1 1 2973
df %>% mutate(status = case_when(Cancelled == 1 ~ "Cancelled",
TRUE ~ "Success")) %>% count(status)
## status n
## 1 Cancelled 2973
## 2 Success 224523
First, produce a table where statistics for each carrier is
shown:
number of flights per carrier
total distance flown in miles per carrier
total actual elapsed time in hours per carrier
total air time in hours per carrier
mean distance per flight for each carrier
mean actual elapsed time in hours per flight for each carrier
mean air time in hours per flight for each carrier
dfcarrier <- df %>% group_by(UniqueCarrier) %>% summarise(numflight = n(),
totdist = sum(Distance),
elaptimehr = sum(ActualElapsedTime, na.rm = T)/60,
airtimehr = sum(AirTime, na.rm = T)/60,
`mean dist/flight` = totdist/numflight,
`mean elaptimehr/flight` = elaptimehr/numflight,
`mean airtimehr/flight` = airtimehr/numflight
) %>%
ungroup()
Second, calculate the percentage of total distance flown by top 3
performing carriers VS total distance flown by remaining carriers.
Execute steps:
first rank carriers by total distance flown
top 3 performers are in one group, remaining carriers are in second
group
for each group calculate total distance flown
for each group calculate %
dfcarrier %>% arrange(desc(totdist)) %>% mutate(rank = row_number(),
group = case_when(rank %in% 1:3 ~ "top3",
T ~ "remaining")) %>%
group_by(group) %>% summarise(tdgroup = sum(totdist)) %>%
mutate(percentage = tdgroup/sum(tdgroup)*100) %>%
select(group, `percent of total distance` = percentage) %>%
arrange(desc(`percent of total distance`)) %>%
ungroup()
## # A tibble: 2 × 2
## group `percent of total distance`
## <chr> <dbl>
## 1 top3 82.3
## 2 remaining 17.7
Modify your main flights table: create date column by uniting columns: year, month, day of month when uniting columns do not lose source columns (mutate each column - with slightly different name, before unite operation is executed) you will need to parse date column after unite operation also you should add leading zeros to month and day of month column before date is created create columns: quarter, week
dfmod <- df %>% mutate_at(.vars = c("Month", "DayofMonth"), .funs = str_pad,
width = 2, pad = 0, side = "left") %>%
unite(col = "Date" ,c(Year, Month, DayofMonth), sep = "-", remove = F) %>%
#because we apply str_pad function on Month and DayofMonth, these columns format
#changed to character, thus we convert it back to integer to simplify the logical code
mutate(Quarter = case_when(as.integer(Month) %in% 1:3 ~ "Q1",
as.integer(Month) %in% 4:6 ~ "Q2",
as.integer(Month) %in% 7:9 ~ "Q3",
as.integer(Month) %in% 10:12 ~ "Q4"),
WeekofMonth = case_when(as.integer(DayofMonth) %in% 1:7 ~ "W1",
as.integer(DayofMonth) %in% 8:14 ~ "W2",
as.integer(DayofMonth) %in% 15:21 ~ "W3",
as.integer(DayofMonth) %in% 22:28 ~ "W4",
TRUE ~ "W5"))
convert date column to appropriate format using ymd function in lubridate packages
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
dfmod$Date <- ymd(dfmod$Date)
the easiest way to create Date column from Year, Month, and DayOfMonth column while convert it to date format is using lubridate
dfsimp <- df %>% mutate(Date = make_date(year = Year, month = Month, day = DayofMonth),
# add column quarter and week
Quarter = case_when(Month %in% 1:3 ~ "Q1",
Month %in% 4:6 ~ "Q2",
Month %in% 7:9 ~ "Q3",
Month %in% 10:12 ~ "Q4"),
WeekofMonth = case_when(DayofMonth %in% 1:7 ~ "W1",
DayofMonth %in% 8:14 ~ "W2",
DayofMonth %in% 15:21 ~ "W3",
DayofMonth %in% 22:28 ~ "W4",
TRUE ~ "W5"))
Using your modified table try to answer the given questions: Is total number of flights increasing or decreasing quarterly? Is total distance increasing or decreasing monthly?
dfsimp %>% select(Date, Quarter, WeekofMonth) %>% group_by(Quarter) %>%
summarise(`Total Flight` = n()) %>%
ggplot() + geom_point(aes(x = Quarter, y = `Total Flight`))
result : total number of flight was increasing until 3rd quarter, but significantly decreasing after 3rd quarter
dfsimp %>% select(Date, Distance) %>% group_by(Month = month(Date, label = T)) %>%
summarise(`Total Dist` = sum(Distance)) %>%
ggplot() + geom_point(aes(x = Month, y = `Total Dist`))
result : total number of distance was unable to determine whether increasing or decreasing monthly, because the pattern showed an unstable trend
The idea for the last exercise is another data wrangling task, where you will have to use technique called “pivoting”. Build a table, that will resemble a heat map by: for each carrier and month, calculate total number of flights then normalize total number of flights (divide each value with maximum total number of flights, you must get values between 0 and 1!) now pivot your table from long to wide format so each row is represented with carrier, and each column is represented with month, normalized total number of flights are values in table cells
# total number of flight for each carrier and month
carrmo <- dfsimp %>% mutate(Bulan = month(Date, label = T)) %>%
select(UniqueCarrier, Bulan) %>% count(UniqueCarrier, Bulan) %>%
mutate(normtotal = n/max(n)) %>% select(UniqueCarrier, Bulan, normtotal) %>%
pivot_wider(names_from = Bulan, values_from = normtotal)
carrmo %>% pivot_longer(cols = c(-"UniqueCarrier"), names_to = "Month", values_to = "Num of Flight") %>%
ggplot() + geom_tile(aes(x = Month, y = UniqueCarrier, fill = `Num of Flight`))