Assignment 7

Go to the shared posit.cloud workspace for this class and open the assign07 project. Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.

The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
library(gt)
flights <- read_csv("https://jsuleiman.com/datasets/domestic_flights_jan_2016.csv")

Rows: 445827 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): FlightDate, Carrier, TailNum, Origin, OriginCityName, OriginState,...
dbl (12): FlightNum, CRSDepTime, DepTime, WheelsOff, WheelsOn, CRSArrTime, A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(flights)

Rows: 445,827
Columns: 21
$ FlightDate        <chr> "1/6/2016", "1/7/2016", "1/8/2016", "1/9/2016", "1/1…
$ Carrier           <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
$ TailNum           <chr> "N4YBAA", "N434AA", "N541AA", "N489AA", "N439AA", "N…
$ FlightNum         <dbl> 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, …
$ Origin            <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF…
$ OriginCityName    <chr> "Dallas/Fort Worth, TX", "Dallas/Fort Worth, TX", "D…
$ OriginState       <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX"…
$ Dest              <chr> "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DT…
$ DestCityName      <chr> "Detroit, MI", "Detroit, MI", "Detroit, MI", "Detroi…
$ DestState         <chr> "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI"…
$ CRSDepTime        <dbl> 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100…
$ DepTime           <dbl> 1057, 1056, 1055, 1102, 1240, 1107, 1059, 1055, 1058…
$ WheelsOff         <dbl> 1112, 1110, 1116, 1115, 1300, 1118, 1113, 1107, 1110…
$ WheelsOn          <dbl> 1424, 1416, 1431, 1424, 1617, 1426, 1429, 1419, 1420…
$ CRSArrTime        <dbl> 1438, 1438, 1438, 1438, 1438, 1438, 1438, 1438, 1438…
$ ArrTime           <dbl> 1432, 1426, 1445, 1433, 1631, 1435, 1438, 1431, 1428…
$ Cancelled         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Diverted          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CRSElapsedTime    <dbl> 158, 158, 158, 158, 158, 158, 158, 158, 158, 158, 15…
$ ActualElapsedTime <dbl> 155, 150, 170, 151, 171, 148, 159, 156, 150, 158, 14…
$ Distance          <dbl> 986, 986, 986, 986, 986, 986, 986, 986, 986, 986, 98…

flights <- flights |>
  mutate(FlightDate = as.Date(FlightDate, format = "%m/%d/%Y"))
flights |> 
  select(FlightDate) |> 
  head()

# A tibble: 6 × 1
  FlightDate
  <date>    
1 2016-01-06
2 2016-01-07
3 2016-01-08
4 2016-01-09
5 2016-01-10
6 2016-01-11

flights |> 
  filter(Cancelled == 0) |> 
  summarize(maxcrsdep = max(CRSDepTime), maxdep = max(DepTime))

# A tibble: 1 × 2
  maxcrsdep maxdep
      <dbl>  <dbl>
1      2359   2400

flights <- flights |> 
  mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime))) |> 
  mutate(new_CRSDepTime = as.POSIXct(new_CRSDepTime, format = "%Y-%m-%d %H%M"))
flights |> 
  select(new_CRSDepTime, FlightDate, CRSDepTime) |>
  head()

# A tibble: 6 × 3
  new_CRSDepTime      FlightDate CRSDepTime
  <dttm>              <date>          <dbl>
1 2016-01-06 11:00:00 2016-01-06       1100
2 2016-01-07 11:00:00 2016-01-07       1100
3 2016-01-08 11:00:00 2016-01-08       1100
4 2016-01-09 11:00:00 2016-01-09       1100
5 2016-01-10 11:00:00 2016-01-10       1100
6 2016-01-11 11:00:00 2016-01-11       1100

flights <- flights |> 
  mutate(
    new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)),
    new_CRSArrTime = as.POSIXct(new_CRSArrTime, format="%Y-%m-%d %H%M")
  )
flights <- flights |> 
  filter(Cancelled == 0) |> 
  mutate(
    new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)), 
    new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)),
    new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), 
    new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime))
    )  |> 
  mutate(
    new_DepTime = as.POSIXct(new_DepTime, format="%Y-%m-%d %H%M"),
    new_WheelsOff = as.POSIXct(new_WheelsOff, format="%Y-%m-%d %H%M"),
    new_WheelsOn = as.POSIXct(new_WheelsOn, format="%Y-%m-%d %H%M"),
    new_ArrTime = as.POSIXct(new_ArrTime, format="%Y-%m-%d %H%M")
  )
flights |> 
  select(new_CRSArrTime, new_DepTime, new_WheelsOff, new_WheelsOn, new_ArrTime) |> 
  head() |> 
  gt()

new_CRSArrTime	new_DepTime	new_WheelsOff	new_WheelsOn	new_ArrTime
2016-01-06 14:38:00	2016-01-06 10:57:00	2016-01-06 11:12:00	2016-01-06 14:24:00	2016-01-06 14:32:00
2016-01-07 14:38:00	2016-01-07 10:56:00	2016-01-07 11:10:00	2016-01-07 14:16:00	2016-01-07 14:26:00
2016-01-08 14:38:00	2016-01-08 10:55:00	2016-01-08 11:16:00	2016-01-08 14:31:00	2016-01-08 14:45:00
2016-01-09 14:38:00	2016-01-09 11:02:00	2016-01-09 11:15:00	2016-01-09 14:24:00	2016-01-09 14:33:00
2016-01-10 14:38:00	2016-01-10 12:40:00	2016-01-10 13:00:00	2016-01-10 16:17:00	2016-01-10 16:31:00
2016-01-11 14:38:00	2016-01-11 11:07:00	2016-01-11 11:18:00	2016-01-11 14:26:00	2016-01-11 14:35:00

flights <- flights |>
  mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
flights |> 
  select(CRSDepTime, DepTime, DepDelay) |> 
  head()

# A tibble: 6 × 3
  CRSDepTime DepTime DepDelay
       <dbl>   <dbl>    <int>
1       1100    1057       -3
2       1100    1056       -4
3       1100    1055       -5
4       1100    1102        2
5       1100    1240      100
6       1100    1107        7

flights <- flights |> 
  filter(Cancelled == 0) |> 
  mutate(
    DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay), 
    DepDel15 = ifelse(DepDelay >= 15, 1, 0)
    )
flights |> 
  select(DepDelay, DepDelayMinutes, DepDel15) |> 
  head()

# A tibble: 6 × 3
  DepDelay DepDelayMinutes DepDel15
     <int>           <dbl>    <dbl>
1       -3               0        0
2       -4               0        0
3       -5               0        0
4        2               2        0
5      100             100        1
6        7               7        0

flights <- flights |>
  mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
flights |> 
  select(CRSDepTime, DepTime, DepDelay) |> 
  head()

# A tibble: 6 × 3
  CRSDepTime DepTime DepDelay
       <dbl>   <dbl>    <int>
1       1100    1057       -3
2       1100    1056       -4
3       1100    1055       -5
4       1100    1102        2
5       1100    1240      100
6       1100    1107        7

flights <- flights |> 
  mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")),
         TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")),
         ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")),
         ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay), 
         ArrDel15 = ifelse(ArrDelay >= 15, 1, 0),
         FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime,
         AirTime = ActualElapsedTime - TaxiOut - TaxiIn,
         AirSpeed = Distance / (AirTime / 60)
)
flights |> 
  head() |> 
  gt()

FlightDate	Carrier	TailNum	FlightNum	Origin	OriginCityName	OriginState	Dest	DestCityName	DestState	CRSDepTime	DepTime	WheelsOff	WheelsOn	CRSArrTime	ArrTime	CRSElapsedTime	ActualElapsedTime	Distance	new_CRSDepTime	new_CRSArrTime	new_DepTime	new_WheelsOff	new_WheelsOn	new_ArrTime	DepDelay	DepDelayMinutes	DepDel15	TaxiOut	TaxiIn	ArrDelay	ArrDelayMinutes	ArrDel15	FlightTimeBuffer	AirTime	AirSpeed
2016-01-06	AA	N4YBAA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1057	1112	1424	1438	1432	158	155	986	2016-01-06 11:00:00	2016-01-06 14:38:00	2016-01-06 10:57:00	2016-01-06 11:12:00	2016-01-06 14:24:00	2016-01-06 14:32:00	-3	0	0	15	8	-6	0	0	3	132	448.1818
2016-01-07	AA	N434AA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1056	1110	1416	1438	1426	158	150	986	2016-01-07 11:00:00	2016-01-07 14:38:00	2016-01-07 10:56:00	2016-01-07 11:10:00	2016-01-07 14:16:00	2016-01-07 14:26:00	-4	0	0	14	10	-12	0	0	8	126	469.5238
2016-01-08	AA	N541AA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1055	1116	1431	1438	1445	158	170	986	2016-01-08 11:00:00	2016-01-08 14:38:00	2016-01-08 10:55:00	2016-01-08 11:16:00	2016-01-08 14:31:00	2016-01-08 14:45:00	-5	0	0	21	14	7	7	0	-12	135	438.2222
2016-01-09	AA	N489AA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1102	1115	1424	1438	1433	158	151	986	2016-01-09 11:00:00	2016-01-09 14:38:00	2016-01-09 11:02:00	2016-01-09 11:15:00	2016-01-09 14:24:00	2016-01-09 14:33:00	2	2	0	13	9	-5	0	0	7	129	458.6047
2016-01-10	AA	N439AA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1240	1300	1617	1438	1631	158	171	986	2016-01-10 11:00:00	2016-01-10 14:38:00	2016-01-10 12:40:00	2016-01-10 13:00:00	2016-01-10 16:17:00	2016-01-10 16:31:00	100	100	1	20	14	113	113	1	-13	137	431.8248
2016-01-11	AA	N468AA	43	DFW	Dallas/Fort Worth, TX	TX	DTW	Detroit, MI	MI	1100	1107	1118	1426	1438	1435	158	148	986	2016-01-11 11:00:00	2016-01-11 14:38:00	2016-01-11 11:07:00	2016-01-11 11:18:00	2016-01-11 14:26:00	2016-01-11 14:35:00	7	7	0	11	9	-3	0	0	10	128	462.1875

flights |> 
  group_by(day = wday(FlightDate, label=TRUE)) |> 
  mutate(delayed = ifelse(DepDelay > 0, 1, 0)) |> 
  summarize(perc_delay = sum(delayed) / n()) |> 
  ggplot(aes(x = day, y = perc_delay)) + 
  geom_col() +
  theme_minimal()

The first thing I Wanted to see in my data was the number of operations that had missing patterns. I see that tail number had the most overall missing values. Overall I observed that the lowest day for delayed flights is Wednesday which leaves the question as to why this is. Another question I have about the data is why Monday is the highest in that category

This is your work area. Add as many code cells as you need.

1 Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

2 Grading Rubric

Item (percent overall)	100% - flawless	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Question 1 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 1 visualization or table. (15%)	Visually pleasant and relevant to the question.
Question 2 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 2 visualization or table. (15%)	Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment. (10%)	You included the description of your subsetted data in your narrative.	You subsetted the data but didn’t include the description in the narrative.	NA	You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown. (8%)
Submitted properly to Brightspace (8%)		NA	NA	You must submit according to instructions to receive any credit for this portion.