We will be analyzing airline flight delays. Note that this Rmd file will not knit in its current form. It is necessary to complete some of the below items for it to knit successfully.
The first step in this assignment will be downloading the flight data from the Bureau of Transportation Statistics
Here are some instructions (this will take some time, make sure one is done downloading before starting the next one. It took me about 17 minutes to casually do this while I did other things):
Now, go to wherever your operating system defaults to downloading, more than likely your ‘download’ directory. There will be 12 .zip files there. Uncompress each of these files.
Make a subdirectory of your R working directory (the working directory is where your .Rmd file is located) called “database”. I suggest making a folder homework1 or some such, and making database a subdirectory.
Now, we want to run the following code to generate an R object that is a single flat file containing all the flights for that year.
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now, we want to take the data and make a data set. First, let’s get the file names from the database directory
paths = dir("database",pattern="T_ONTIME_REPORTING",full.names=TRUE)
print(paths)
## [1] "database/T_ONTIME_REPORTING_APR.csv"
## [2] "database/T_ONTIME_REPORTING_AUGUST.csv"
## [3] "database/T_ONTIME_REPORTING_DEC.csv"
## [4] "database/T_ONTIME_REPORTING_FEB.csv"
## [5] "database/T_ONTIME_REPORTING_Jan.csv"
## [6] "database/T_ONTIME_REPORTING_JULY.csv"
## [7] "database/T_ONTIME_REPORTING_june.csv"
## [8] "database/T_ONTIME_REPORTING_Mar.csv"
## [9] "database/T_ONTIME_REPORTING_May.csv"
## [10] "database/T_ONTIME_REPORTING_Nov.csv"
## [11] "database/T_ONTIME_REPORTING_Oct.csv"
## [12] "database/T_ONTIME_REPORTING_Sept.csv"
names(paths) = basename(paths)
What does ‘basename’ do here?
basename name is a function that obtains or extracts the file name by removing directory part from the path. From the above code, ‘paths’ variable has all 12 files relative path, with basename function it just obtains the file name.
This next step can take a few minutes. Let’s time it! Accurately timing code is extremely delicate and controversial because it depends on all the settings on your computer, what other programs you have running, architecture (do you have multiple cores/processors/…) even how long it has been since you rebooted!
But, a simple and useful way is via `proc.time’, which records the computer time of your processor. Let’s read in each .csv file into an R data file that has all of the flights.
(note: we will use ‘cache = TRUE’ so that R studio won’t re-compile this section every time you knit. It will detect any changes to this `chunk’ and re-run if there are any. However, it doesn’t detect if there are changes in another chunk that affects this chunk. For that, we need ‘dependson’. In this case, a change to the directory database would affect ‘df’. So, we want to add ‘dependson’ pointed to that chunk. Note that each dependson chunk must be cached as well)
srt = proc.time()[3]
dfList = lapply(paths,read.csv)
df = do.call(rbind,dfList)
end = proc.time()[3]
df$X = NULL #get rid of that empty column
save(df, file="2022flights.Rdata")
totalTime = end-srt #### Answer 0.2 ## Overwrite the totalTime object so that it has the correct elapsed time
Answer: This process took 32.678 seconds.
Note That using purely R code, we can report the total time as well. This would be useful if you weren’t using R markdown:
cat('total time: ',end-srt)
## total time: 32.678
Let’s look at some attributes of this data set
dataSetSize = format(object.size(df),units='MiB')
The total size of the flights data set for that year is 1618.5 MiB
(Use inline code as the previous answer (don’t just ‘hard code’ the value))
Also, it is a good opportunity to make sure you have the complete data set. For instance, check to make sure you have the correct number of months: I have 4, 8, 12, 2, 1, 7, 6, 3, 5, 11, 10, 9 months out of the expected 12 total months.
Some of the flights were cancelled and hence never flew. These are recorded via the ‘cancelled’ feature. How many total flights are there in the data set? How many were cancelled?
Additionally, there are some missing values, coded as NA, in the data set. Generally, we would want to impute these missing values. However, we will return to imputation later in the semester. For now, let’s remove the rows (flights) with missing values. Make a new object called flightsNotCancelled that have the flights that were not cancelled and have no missing values. Make sure to use this for the rest of the assignment. Lastly, how many flights do we have remaining in flightsNotCancelled?
totalNum = nrow(df)
cancelNum = df %>%
group_by(CANCELLED) %>% #### Answer 1.1 ## Insert the correct group_by statement here
summarise(n = n())
#I just want the integer, not the tibble:
cancelNum = as.integer(cancelNum[2,2])
flightsNotCancelled = df %>%
filter(CANCELLED == 0.00) %>% #### Answer 1.2 Insert the correct filter statement here
na.omit(.)
make sure you complete this answer dynamically (i.e. don’t hardcode in the answer, use object names)
The total number of flights is 6847899 . The total number of cancelled flights is 87943 . The total number of flights remaining after removing NAs is 6743403.
How many flights were there for each day of the week?
flightsPerDay = flightsNotCancelled %>%
group_by(DAY_OF_WEEK) %>%
summarize(total = n())
flightsPerDay
## # A tibble: 7 × 2
## DAY_OF_WEEK total
## <int> <int>
## 1 1 996633
## 2 2 937570
## 3 3 951327
## 4 4 998363
## 5 5 1003622
## 6 6 870955
## 7 7 984933
flightsPerDay = flightsNotCancelled %>%
group_by(DAY_OF_WEEK) %>%
summarize(total = n()) %>%
mutate(proportion = total / sum(total))
flightsPerDay
## # A tibble: 7 × 3
## DAY_OF_WEEK total proportion
## <int> <int> <dbl>
## 1 1 996633 0.148
## 2 2 937570 0.139
## 3 3 951327 0.141
## 4 4 998363 0.148
## 5 5 1003622 0.149
## 6 6 870955 0.129
## 7 7 984933 0.146
ggplot(data = flightsPerDay) + geom_point(aes(x = DAY_OF_WEEK, y = proportion), colour = "blue")
#### Answer 2.1 ## Alter this plot so that the points are blue and the plot is of the proportion of flights each day instead of total
Which month has the greatest proportion of late flights?
props = flightsNotCancelled %>%
group_by(MONTH) %>%
mutate(delayInd = ifelse(DEP_DELAY > 0,1,0)) %>%
#### Answer 3.1 Fill in the missing part
summarize(total = n(),success = sum(delayInd),prop = mean(delayInd)) %>%
filter(prop == max(prop))
props
## # A tibble: 1 × 4
## MONTH total success prop
## <int> <int> <dbl> <dbl>
## 1 7 585058 276357 0.472
The month with the greatest proportion of late flights is 7
Which departure and arrival airport combination is associated with the worst median departure delay? Make sure the airport combination has at least 1000 recorded flights.
You’re a bit on your own here. Use the ideas from class and from this homework to answer this question. Be sure to take a look at the nature of the delay feature and see how it represents early departures. You can think of delays in two ways: with early departures (that is, negative delay) or without early departures. Explore this question a bit and see if your answer changes depending on the definition.
# filter to have atleast 1000 records
airport_combo = flightsNotCancelled %>%
group_by(ORIGIN, DEST) %>%
summarise(nDep = n(), DEP_DELAY, ORIGIN, DEST, ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop") %>%
filter(nDep >= 1000)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
###### WAY 01 : Entirely Removing Early Departures (removing negative dep delay rows)
# filter to without considering early departures
# approach : entirely removed the rows with negative delay
airport_combo_without_early_dep = airport_combo %>%
group_by(ORIGIN, DEST) %>%
filter(DEP_DELAY >= 0)
# applying median departure delay
median_dep_delay = airport_combo_without_early_dep %>%
group_by(ORIGIN, DEST) %>%
summarise(depDelayMedian = median(DEP_DELAY), ORIGIN, DEST, ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Sorting based on median to obtain median delay
worst_airport_combo = median_dep_delay %>%
arrange(desc(depDelayMedian)) %>%
slice(1)
worst_airport_combo
## # A tibble: 1 × 5
## ORIGIN DEST depDelayMedian ORIGIN_CITY_NAME DEST_CITY_NAME
## <chr> <chr> <dbl> <chr> <chr>
## 1 TXK DFW 42.5 Texarkana, AR Dallas/Fort Worth, TX
###### WAY 02 : Removing Early Departures by considering negative values as 0
# airport_combo_zeoring_neg_delay = airport_combo %>%
# group_by(ORIGIN, DEST) %>%
# mutate(recordDep = ifelse(DEP_DELAY > 0,DEP_DELAY,0)) %>%
# summarise(nDep = n(), delaySumDep = median(recordDep), ORIGIN_CITY_NAME, DEST_CITY_NAME, .groups = "drop")
#
# worst_airport_combo = airport_combo_zeoring_neg_delay %>%
# arrange(desc(delaySumDep)) %>%
# slice(1)
#
# worst_airport_combo
The worst departure/arrival airport combination in terms of median delay without including early departures is TXK - DFW
What are these two airports? Don’t answer using the airport number or 3 letter code. Instead, answer using the city/state or country of the airport.
Those two airports are : Texarkana, AR and Dallas/Fort Worth, TX