Introduction and Data

We will be analyzing airline flight delays. Note that this Rmd file will not knit in its current form. It is necessary to complete some of the below items for it to knit successfully.

Download

The first step in this assignment will be downloading the flight data from the Bureau of Transportation Statistics

Here are some instructions (this will take some time, make sure one is done downloading before starting the next one. It took me about 17 minutes to casually do this while I did other things):

Set “Filter Year” to 2023
Set “Filter Period” to January
Select the following fields:
- Month
- DayofMonth
- DayofWeek
- FlightDate
- Reporting_Airline
- DOT_ID_Reporting_Airline
- OriginAirportID
- OriginCityMarketID
- Origin
- OriginCityName
- OriginState
- DestAirportID
- DestCityMarketID
- Dest
- DestCityName
- DestState
- DepDelay
- DepDel15
- ArrDelay
- ArrDel15
- Cancelled
- CancellationCode
Click “Download”
Change the “Filter Period” to February and Click “Download”
Change the “Filter Period” to March and Click “Download”
…
Change the “Filter Period” to December and Click “Download”

Uncompress and rename

Now, go to wherever your operating system defaults to downloading, more than likely your ‘download’ directory. There will be 12 .zip files there. Uncompress each of these files.

Move the files

Make a subdirectory of your R working directory (the working directory is where your .Rmd file is located) called “database”. I suggest making a folder homework1 or some such, and making database a subdirectory.

Now, we want to run the following code to generate an R object that is a single flat file containing all the flights for that year.

Let’s load in some packages.

require(tidyverse)

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Problem 0 (20 pts)

Now, we want to take the data and make a data set. First, let’s get the file names from the database directory

paths = dir("database",pattern="T_ONTIME_REPORTING",full.names=TRUE)
print(paths)

##  [1] "database/T_ONTIME_REPORTING_APR.csv"   
##  [2] "database/T_ONTIME_REPORTING_AUGUST.csv"
##  [3] "database/T_ONTIME_REPORTING_DEC.csv"   
##  [4] "database/T_ONTIME_REPORTING_FEB.csv"   
##  [5] "database/T_ONTIME_REPORTING_Jan.csv"   
##  [6] "database/T_ONTIME_REPORTING_JULY.csv"  
##  [7] "database/T_ONTIME_REPORTING_june.csv"  
##  [8] "database/T_ONTIME_REPORTING_Mar.csv"   
##  [9] "database/T_ONTIME_REPORTING_May.csv"   
## [10] "database/T_ONTIME_REPORTING_Nov.csv"   
## [11] "database/T_ONTIME_REPORTING_Oct.csv"   
## [12] "database/T_ONTIME_REPORTING_Sept.csv"

names(paths) = basename(paths)

Answer 0.1

What does ‘basename’ do here?

basename name is a function that obtains or extracts the file name by removing directory part from the path. From the above code, ‘paths’ variable has all 12 files relative path, with basename function it just obtains the file name.

Creating the data set

This next step can take a few minutes. Let’s time it! Accurately timing code is extremely delicate and controversial because it depends on all the settings on your computer, what other programs you have running, architecture (do you have multiple cores/processors/…) even how long it has been since you rebooted!

But, a simple and useful way is via `proc.time’, which records the computer time of your processor. Let’s read in each .csv file into an R data file that has all of the flights.

(note: we will use ‘cache = TRUE’ so that R studio won’t re-compile this section every time you knit. It will detect any changes to this `chunk’ and re-run if there are any. However, it doesn’t detect if there are changes in another chunk that affects this chunk. For that, we need ‘dependson’. In this case, a change to the directory database would affect ‘df’. So, we want to add ‘dependson’ pointed to that chunk. Note that each dependson chunk must be cached as well)

srt    = proc.time()[3]
dfList = lapply(paths,read.csv)
df     = do.call(rbind,dfList)
end    = proc.time()[3]

df$X = NULL #get rid of that empty column
save(df, file="2022flights.Rdata")

totalTime = end-srt #### Answer 0.2 ## Overwrite the totalTime object so that it has the correct elapsed time

Answer: This process took 32.678 seconds.

Note That using purely R code, we can report the total time as well. This would be useful if you weren’t using R markdown:

cat('total time: ',end-srt)

## total time:  32.678

The data set

Let’s look at some attributes of this data set

dataSetSize = format(object.size(df),units='MiB')

Answer 0.3

The total size of the flights data set for that year is 1618.5 MiB

(Use inline code as the previous answer (don’t just ‘hard code’ the value))

Also, it is a good opportunity to make sure you have the complete data set. For instance, check to make sure you have the correct number of months: I have 4, 8, 12, 2, 1, 7, 6, 3, 5, 11, 10, 9 months out of the expected 12 total months.

Problem 1 (20 pts)

Some of the flights were cancelled and hence never flew. These are recorded via the ‘cancelled’ feature. How many total flights are there in the data set? How many were cancelled?

Additionally, there are some missing values, coded as NA, in the data set. Generally, we would want to impute these missing values. However, we will return to imputation later in the semester. For now, let’s remove the rows (flights) with missing values. Make a new object called flightsNotCancelled that have the flights that were not cancelled and have no missing values. Make sure to use this for the rest of the assignment. Lastly, how many flights do we have remaining in flightsNotCancelled?

totalNum  = nrow(df)
cancelNum =  df %>%
  group_by(CANCELLED) %>% #### Answer 1.1 ## Insert the correct group_by statement here
  summarise(n = n())

#I just want the integer, not the tibble:
cancelNum = as.integer(cancelNum[2,2])

flightsNotCancelled = df %>%
  filter(CANCELLED == 0.00) %>% #### Answer 1.2  Insert the correct filter statement here
  na.omit(.)

Answer 1.3

make sure you complete this answer dynamically (i.e. don’t hardcode in the answer, use object names)

The total number of flights is 6847899 . The total number of cancelled flights is 87943 . The total number of flights remaining after removing NAs is 6743403.

Problem 2 (15 pts)

How many flights were there for each day of the week?

flightsPerDay = flightsNotCancelled %>% 
  group_by(DAY_OF_WEEK) %>% 
  summarize(total = n())

flightsPerDay

## # A tibble: 7 × 2
##   DAY_OF_WEEK   total
##         <int>   <int>
## 1           1  996633
## 2           2  937570
## 3           3  951327
## 4           4  998363
## 5           5 1003622
## 6           6  870955
## 7           7  984933

flightsPerDay = flightsNotCancelled %>% 
  group_by(DAY_OF_WEEK) %>% 
  summarize(total = n()) %>%
  mutate(proportion = total / sum(total))

flightsPerDay

## # A tibble: 7 × 3
##   DAY_OF_WEEK   total proportion
##         <int>   <int>      <dbl>
## 1           1  996633      0.148
## 2           2  937570      0.139
## 3           3  951327      0.141
## 4           4  998363      0.148
## 5           5 1003622      0.149
## 6           6  870955      0.129
## 7           7  984933      0.146

ggplot(data = flightsPerDay) + geom_point(aes(x = DAY_OF_WEEK, y = proportion), colour = "blue")

#### Answer 2.1 ## Alter this plot so that the points are blue and the plot is of the proportion of flights each day instead of total

Problem 3 (20 pts)

Which month has the greatest proportion of late flights?

props = flightsNotCancelled %>%
  group_by(MONTH) %>%
  mutate(delayInd = ifelse(DEP_DELAY > 0,1,0)) %>% 
  #### Answer 3.1 Fill in the missing part
  summarize(total = n(),success = sum(delayInd),prop = mean(delayInd)) %>%
  filter(prop == max(prop))

props

## # A tibble: 1 × 4
##   MONTH  total success  prop
##   <int>  <int>   <dbl> <dbl>
## 1     7 585058  276357 0.472

Answer 3.2

The month with the greatest proportion of late flights is 7

Problem 4 (25 pts)

Which departure and arrival airport combination is associated with the worst median departure delay? Make sure the airport combination has at least 1000 recorded flights.

You’re a bit on your own here. Use the ideas from class and from this homework to answer this question. Be sure to take a look at the nature of the delay feature and see how it represents early departures. You can think of delays in two ways: with early departures (that is, negative delay) or without early departures. Explore this question a bit and see if your answer changes depending on the definition.

# filter to have atleast 1000 records
airport_combo = flightsNotCancelled %>%
  group_by(ORIGIN, DEST) %>%
  summarise(nDep = n(), DEP_DELAY, ORIGIN, DEST, ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop") %>%
  filter(nDep >= 1000)

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

###### WAY 01 : Entirely Removing Early Departures (removing negative dep delay rows)
# filter to without considering early departures
# approach : entirely removed the rows with negative delay
airport_combo_without_early_dep = airport_combo %>%
  group_by(ORIGIN, DEST) %>%
  filter(DEP_DELAY >= 0)

# applying median departure delay
median_dep_delay = airport_combo_without_early_dep %>%
  group_by(ORIGIN, DEST) %>%
  summarise(depDelayMedian = median(DEP_DELAY), ORIGIN, DEST, ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop")

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Sorting based on median to obtain median delay
worst_airport_combo = median_dep_delay %>%
  arrange(desc(depDelayMedian)) %>%
  slice(1)

worst_airport_combo

## # A tibble: 1 × 5
##   ORIGIN DEST  depDelayMedian ORIGIN_CITY_NAME DEST_CITY_NAME       
##   <chr>  <chr>          <dbl> <chr>            <chr>                
## 1 TXK    DFW             42.5 Texarkana, AR    Dallas/Fort Worth, TX

###### WAY 02 : Removing Early Departures by considering negative values as 0
# airport_combo_zeoring_neg_delay = airport_combo %>%
#   group_by(ORIGIN, DEST) %>%
#   mutate(recordDep = ifelse(DEP_DELAY > 0,DEP_DELAY,0)) %>%
#   summarise(nDep = n(), delaySumDep = median(recordDep), ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop")
# 
# worst_airport_combo = airport_combo_zeoring_neg_delay %>%
#   arrange(desc(delaySumDep)) %>%
#   slice(1)
#   
# worst_airport_combo

Answer 4.1

The worst departure/arrival airport combination in terms of median delay without including early departures is TXK - DFW

Problem 4.2

What are these two airports? Don’t answer using the airport number or 3 letter code. Instead, answer using the city/state or country of the airport.

Answer 4.2

Those two airports are : Texarkana, AR and Dallas/Fort Worth, TX

STAT656: Homework 1

Analyzing Airline Flight Delays.