this assignment was completed without code from the solution walkthrough, which means it was conducted using code by my version. so the process to reach the results might be slightly different from the solution code by Markho(the instructor).
the code documentation are in my github

Exercise 1

load data & library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

df <- hflights::hflights
class(df)

## [1] "data.frame"

How many rows and columns are in table hflights?

paste("number of rows:", nrow(df)); paste("number of columns:", ncol(df))

## [1] "number of rows: 227496"

## [1] "number of columns: 21"

How many different carriers are listed in the table (print a table with distinct carrier names)?

df %>% count(UniqueCarrier)

##    UniqueCarrier     n
## 1             AA  3244
## 2             AS   365
## 3             B6   695
## 4             CO 70032
## 5             DL  2641
## 6             EV  2204
## 7             F9   838
## 8             FL  2139
## 9             MQ  4648
## 10            OO 16061
## 11            UA  2072
## 12            US  4082
## 13            WN 45343
## 14            XE 73053
## 15            YV    79

another way that only display the distinct carriers name

df %>% distinct(UniqueCarrier)

##    UniqueCarrier
## 1             AA
## 2             AS
## 3             B6
## 4             CO
## 5             DL
## 6             OO
## 7             UA
## 8             US
## 9             WN
## 10            EV
## 11            F9
## 12            FL
## 13            MQ
## 14            XE
## 15            YV

Which and how many airports were involved? Consider both origin and destination airports!

paste("there are about", nrow(df %>% distinct(Dest)), "dest airport are involved") ; paste("there are about", nrow(df %>% distinct(Origin)), "origin airport are involved")

## [1] "there are about 116 dest airport are involved"

## [1] "there are about 2 origin airport are involved"

How many flights were cancelled?

df %>% count(Cancelled) %>% filter(Cancelled == 1)

##   Cancelled    n
## 1         1 2973

df %>% mutate(status = case_when(Cancelled == 1 ~ "Cancelled",
                                 TRUE ~ "Success")) %>% count(status)

##      status      n
## 1 Cancelled   2973
## 2   Success 224523

Exercise 2

First, produce a table where statistics for each carrier is shown:
number of flights per carrier
total distance flown in miles per carrier
total actual elapsed time in hours per carrier
total air time in hours per carrier
mean distance per flight for each carrier
mean actual elapsed time in hours per flight for each carrier
mean air time in hours per flight for each carrier

dfcarrier <- df %>% group_by(UniqueCarrier) %>% summarise(numflight = n(),
                                             totdist = sum(Distance),
                                             elaptimehr = sum(ActualElapsedTime, na.rm = T)/60,
                                             airtimehr = sum(AirTime, na.rm = T)/60,
                                             `mean dist/flight` = totdist/numflight,
                                             `mean elaptimehr/flight` = elaptimehr/numflight,
                                             `mean airtimehr/flight` = airtimehr/numflight
                                             ) %>% 
        ungroup()

Second, calculate the percentage of total distance flown by top 3 performing carriers VS total distance flown by remaining carriers. Execute steps:
first rank carriers by total distance flown
top 3 performers are in one group, remaining carriers are in second group
for each group calculate total distance flown
for each group calculate %

dfcarrier %>% arrange(desc(totdist)) %>% mutate(rank = row_number(),
                                                group = case_when(rank %in% 1:3 ~ "top3",
                                                                  T ~ "remaining")) %>% 
        group_by(group) %>% summarise(tdgroup = sum(totdist)) %>%
        mutate(percentage = tdgroup/sum(tdgroup)*100) %>% 
        select(group, `percent of total distance` = percentage) %>%
        arrange(desc(`percent of total distance`)) %>% 
        ungroup()

## # A tibble: 2 × 2
##   group     `percent of total distance`
##   <chr>                           <dbl>
## 1 top3                             82.3
## 2 remaining                        17.7

Exercise 3

Modify your main flights table: create date column by uniting columns: year, month, day of month when uniting columns do not lose source columns (mutate each column - with slightly different name, before unite operation is executed) you will need to parse date column after unite operation also you should add leading zeros to month and day of month column before date is created create columns: quarter, week

dfmod <- df %>% mutate_at(.vars = c("Month", "DayofMonth"), .funs = str_pad,
                 width = 2, pad = 0, side = "left") %>% 
        unite(col = "Date" ,c(Year, Month, DayofMonth), sep = "-", remove = F) %>% 
        #because we apply str_pad function on Month and DayofMonth, these columns format
        #changed to character, thus we convert it back to integer to simplify the logical code
        mutate(Quarter = case_when(as.integer(Month) %in% 1:3 ~ "Q1",
                                   as.integer(Month) %in% 4:6 ~ "Q2",
                                   as.integer(Month) %in% 7:9 ~ "Q3",
                                   as.integer(Month) %in% 10:12 ~ "Q4"),
               WeekofMonth = case_when(as.integer(DayofMonth) %in% 1:7 ~ "W1",
                                as.integer(DayofMonth) %in% 8:14 ~ "W2",
                                as.integer(DayofMonth) %in% 15:21 ~ "W3",
                                as.integer(DayofMonth) %in% 22:28 ~ "W4",
                                TRUE ~ "W5"))

convert date column to appropriate format using ymd function in lubridate packages

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

dfmod$Date <- ymd(dfmod$Date)

the easiest way to create Date column from Year, Month, and DayOfMonth column while convert it to date format is using lubridate

dfsimp <- df %>% mutate(Date = make_date(year = Year, month = Month, day = DayofMonth),
              # add column quarter and week
              Quarter = case_when(Month %in% 1:3 ~ "Q1",
                                  Month %in% 4:6 ~ "Q2",
                                  Month %in% 7:9 ~ "Q3",
                                  Month %in% 10:12 ~ "Q4"),
              WeekofMonth = case_when(DayofMonth %in% 1:7 ~ "W1",
                                      DayofMonth %in% 8:14 ~ "W2",
                                      DayofMonth %in% 15:21 ~ "W3",
                                      DayofMonth %in% 22:28 ~ "W4",
                                      TRUE ~ "W5"))

Using your modified table try to answer the given questions: Is total number of flights increasing or decreasing quarterly? Is total distance increasing or decreasing monthly?

dfsimp %>% select(Date, Quarter, WeekofMonth) %>%  group_by(Quarter) %>% 
        summarise(`Total Flight` = n()) %>% 
ggplot() + geom_point(aes(x = Quarter, y = `Total Flight`))

result : total number of flight was increasing until 3rd quarter, but significantly decreasing after 3rd quarter

dfsimp %>% select(Date, Distance) %>% group_by(Month = month(Date, label = T)) %>%
        summarise(`Total Dist` = sum(Distance)) %>% 
        ggplot() + geom_point(aes(x = Month, y = `Total Dist`))

result : total number of distance was unable to determine whether increasing or decreasing monthly, because the pattern showed an unstable trend

Exercise 4

The idea for the last exercise is another data wrangling task, where you will have to use technique called “pivoting”. Build a table, that will resemble a heat map by: for each carrier and month, calculate total number of flights then normalize total number of flights (divide each value with maximum total number of flights, you must get values between 0 and 1!) now pivot your table from long to wide format so each row is represented with carrier, and each column is represented with month, normalized total number of flights are values in table cells

# total number of flight for each carrier and month
carrmo <- dfsimp %>% mutate(Bulan = month(Date, label = T)) %>%
        select(UniqueCarrier, Bulan) %>% count(UniqueCarrier, Bulan) %>% 
        mutate(normtotal = n/max(n)) %>% select(UniqueCarrier, Bulan, normtotal) %>% 
        pivot_wider(names_from = Bulan, values_from = normtotal)

carrmo %>% pivot_longer(cols = c(-"UniqueCarrier"), names_to = "Month", values_to = "Num of Flight") %>% 
        ggplot() + geom_tile(aes(x = Month, y = UniqueCarrier, fill = `Num of Flight`))

Section 1 Assignment for ‘Data Science with R: tidyverse’ by Markho Intihar at Udemy

Dzaky Jaya

2022-09-02