Question 2

Load the CSV file FlightsWithAirlines.csv containing composite flight and airline information objects into a data frame called df.flights. Do not load the text (strings) attributes (columns) as factors, so use stringsAsFactors = FALSE as a parameter in your function that loads the data; load them as text. Do not add any path as part of the file name or in your code. Load the file from its URL – do not load from your computer. To get the URL for the CSV, right-click on the link and then select “Copy Link Address” or a similar menu option for your browser; do not click on the link as that will cause the browser to attempt to download and display the file. Do not echo the code into your knitted document.

df.flights <- read.csv(url("https://s3.us-east-2.amazonaws.com/artificium.us/datasets/FlightsWithAirlines.csv"), stringsAsFactors = FALSE)

Note: with R 4.0.0 and later, read.csv has been modified so that stringsAsFactors = FALSE is a default value, making string columns load automatically as text and not factors, but it is still good practice to include it for code clarity and compatibility, particularly if code is run with older versions of R!

Question 3

Produce the following markdown in your document and embed code within the markdown so it is dynamic. Do not echo the code into the knitted document that produces the markdown; we just want the text (this is the essence of “literate programming”). You may add any additional code chunks to make calculations, but do not echo them in your knitted document. The highlighted numbers are calculated (you cannot “hard code” them as the data might change the next time you knit).You do not need to highlight them in your markdown, although you may.

# Calculate basic dataset properties
num_rows <- nrow(df.flights)  # Get total number of rows
num_cols <- ncol(df.flights)  # Get total number of columns

# Calculate statistics on flight delays
avg_delay <-round(mean(df.flights$dep_delay, na.rm=TRUE))  # Calculate mean delay, rounded to nearest min (to a whole number)
quantity_delayed_flights <- sum(df.flights$dep_delay > 60, na.rm=TRUE)  # Count flights with delays longer than 60 minutes

The dataset contains 18 observations (rows) and 14 columns. Each row is information about a flight, including departure and departure delay information. The average mean departure delay is 70 minutes. A total of 3 flights experienced a delay of more than an hour.

Question 4

Add a new column ‘isNight’ to the data frame with a value of T or F, depending whether the actual departure time was at night. Set to “T” if the time was after 7pm and before 5am. The time in the data file is in 24-hour form. Hint: Look up how to use the ifelse function. Use the dep_hr, dep_min, and dep_delay columns. You may simplify the question to only use the dep_hr column, but you will only get 50% of the credit

# Step 1: Convert scheduled departure times to total minutes since midnight so factors have same units for calculation
total_minutes_scheduled <- df.flights$dep_hr * 60 + df.flights$dep_min  # Multiply hours by 60 to get minutes then add departure minutes

# Step 2: Add delay to total scheduled minutes since midnight to get actual departure time in minutes
total_minutes_actual <- total_minutes_scheduled + df.flights$dep_delay  # Scheduled time plus delay time equals actual time of departure in minutes

# Step 3: Ensure time is expressed as minutes within a 24-hour day (0-1439). Use modulo 1440 to manage times past midnight, for example, 1455 minutes (12:15 AM the next day) becomes 0:15 for determining day/night flight, and the day when actual time > 1439, just identifying day or night.
total_minutes_actual <- total_minutes_actual %% 1440

# Step 4: Extract hour component from total minutes for night identification.
actual_hours <- total_minutes_actual %/% 60  # Use integer division to ensure whole number

# Step 5: Create isNight column in df.flights based on the actual hours calculation in step 4. Night is defined as 1900-0500 (7pm-5am)
df.flights$isNight <- ifelse(actual_hours >=19 | actual_hours < 5, TRUE, FALSE)  # this vectorized conditional function returns TRUE if the hour is greater than or equal to 19, and less than 5, and returns FALSE if conditions are not met

And finally, print the dataframe to ensure the new column is there and is correct. You may just use the print() function or experiment with the kable() function.

#Printing the dataframe with kable() yields a more visually appealing table with proper column headers
library(knitr)  # Load the knitr library to use kable()
kable(df.flights, caption = "Flight Data with Night Status")
Flight Data with Night Status
year month day dep_hr dep_min dep_delay carrier airline country flight equip tailnum origin dest isNight
2022 9 18 2 38 0 UA United USA 1545 B737-8 N14228 EWR IAH TRUE
2022 2 14 12 25 13 AA American Airlines USA 441 B777 N24211 MIA GRU FALSE
2022 1 7 13 56 9 AA American Airlines USA 1141 A321 N619AA JFK MIA FALSE
2022 6 7 22 56 7 B6 JetBlue USA 725 A321 N804JB JFK BQN TRUE
2022 1 30 24 2 14 DL Delta Airlines USA 461 B757-2 N668DN LGA ATL TRUE
2022 5 2 11 2 49 UA United USA 1696 B737-MAX N39463 EWR ORD FALSE
2022 5 5 5 50 13 B6 JetBlue USA 507 A321 N516JB EWR FLL FALSE
2022 10 24 20 23 50 NK Spirit Airways USA 5708 A321 N829AS BOS PBI TRUE
2022 6 16 5 31 30 LH Lufthansa Germany 411 B747-4 N593JB CLT MUC FALSE
2022 8 18 2 38 20 UA United USA 1545 B737-8 N14228 EWR IAH TRUE
2022 3 14 12 25 91 AA American Airlines USA 441 B777 N24211 MIA GRU FALSE
2022 2 7 13 56 0 AA American Airlines USA 1141 A321 N619AA JFK MIA FALSE
2022 5 7 22 56 2 B6 JetBlue USA 725 A321 N804JB JFK BQN TRUE
2023 2 30 24 2 0 DL Delta Airlines USA 461 B757-2 N668DN LGA ATL TRUE
2023 6 2 11 2 68 UA United USA 1696 B737-MAX N39463 EWR ORD FALSE
2023 6 5 5 50 850 B6 JetBlue USA 507 A321 N516JB EWR FLL TRUE
2023 11 24 20 23 20 NK Spirit Airways USA 5708 A321 N829AS BOS PBI TRUE
2024 7 16 5 31 18 LH Lufthansa Germany 411 B747-4 N593JB CLT MUC FALSE