Trelliscope_Assignment4

# Load necessary packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(trelliscopejs)
library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(purrr)

# Investigate initial correlations
cor(nycflights[,c(4:7,13:14)])

##              dep_time   dep_delay   arr_time   arr_delay    air_time
## dep_time   1.00000000  0.25929359 0.66391209  0.23196328 -0.02110843
## dep_delay  0.25929359  1.00000000 0.04225206  0.91606217 -0.01342772
## arr_time   0.66391209  0.04225206 1.00000000  0.03443377  0.05413819
## arr_delay  0.23196328  0.91606217 0.03443377  1.00000000 -0.02824045
## air_time  -0.02110843 -0.01342772 0.05413819 -0.02824045  1.00000000
## distance  -0.02056249 -0.01269835 0.04737048 -0.05445608  0.99072420
##              distance
## dep_time  -0.02056249
## dep_delay -0.01269835
## arr_time   0.04737048
## arr_delay -0.05445608
## air_time   0.99072420
## distance   1.00000000

# Make a date column from the separate year, month, and day columns using make_date in lubridate package
nycflights <- nycflights %>%
  mutate(date = make_date(month = month, day = day, year = year))

# This will be scatterplots faceted by airline carrier and hour of the day (on a 24 hour clock)

# I will find the cutoff for outliers by adding 1.5*IQR of airline departure delays to theh 75th quantile for each group of hour of day and airline carrier to determine which points are not shown

# Then, I filter to only include points that have departure delays that are less than this upper outlier bound 

# A cognostic to show the number of outliers removed for each combination of airline carrier and hour will be shown on the panels

# Another cognostic to a wikipedia page detailing which airline carrier the abbreviation from the data corresponds to, with additional information

# The scatterplots show the date on the x-axis, with departure delay in minutes on the y-axis. Each plot only shows values for a certain hour on a 24 hour clock for a certain airline carrier. The points are also colored by the airport of origin because this gives additional information without crowding the plot

# Axes are set to free to be able to show the respective axes for each plot better without extra space


nycflights %>%group_by(hour, carrier)%>%
  mutate(cutoff = quantile(dep_delay, 0.75)+1.5*(IQR(dep_delay)))%>%
    mutate(num_outliers = cog(val = sum(dep_delay > cutoff), desc = "Departure Delay is greater than 1.5*IQR of group", default_label = TRUE))%>%
  filter(dep_delay < cutoff) %>%
  mutate(carrier_wiki_page = cog(val = paste0("https://en.wikipedia.org/wiki/", carrier), 
            desc = "Airline Carrier Info", default_label = TRUE))%>%
  ggplot(aes(date, dep_delay, color = origin))+
  geom_point(alpha = 0.5)+
  facet_trelliscope(~carrier + hour, scales = "free",
                    name = "Departure Delay in Minutes by Date, faceted by hour of the day and airline carrier", desc = "Random sample of 32,735 flights that departed from NYC in 2013 (Negative Departures = Early)", 
                    nrow = 1, ncol = 2,
                    path = ".",
                    self_contained = TRUE)

Description

The dataset I am using contains 32,735 rows initially, with each row corresponding to a flight that departed from a New York airport in 2013. The original dataset contains 16 variables, among these are variables relating to the year, month, day, and hour of the flight. Also, there is information on departure and arrival delays in minutes. It is important to realize that there are negative times in the departure and arrival delay columns, but these are okay and reveal flights that arrived or departed earlier than expected. There is a two letter carrier abbreviation for the airline carriers present, of which there are 16. We can see how much time was spent in the air on the flight and the total distance flown, as well as the origin and destination of each flight.

I chose to plot in a scatterplot the departure delays in minutes against the dates of the flights, faceted by hour of the day (on a 24 hour clock, so hour can range from 0-24) and by airline carrier. I also colored the points by airport of origin since there are only 3 possible NY airports of departure (JFK, LGA, or EWR), and this information could add some value. As can be seen in the code, I calculated 1.5*IQR of the departure times for each group of hour of departure and airline carrier. One of the challenges I faced when creating these plots were that there are some extra large departure delays, so I decided to focus in on those points that would not be considered upper outliers. By filtering where the departure time is less than the upper outlier cutoff, I was able to remove extra large points in each plot. Setting the axes to “free” also helped alleviate blank space in the plots. I am trying to investigate possible time trends by month and day/night, as well as carrier. It could be useful to see which airline carrier has the longest departure delays for certain hours and months/days, for example, to get an idea of which carriers you may not want to fly with if you have a specific day and time you are looking for a flight. It was a challenge to consider how to combine all these dimensions, but I created multiple plots until the final plot captured the relevant information I wanted to explore. I think a scatterplot is definitely the best way to display all these variables. With the plots, you can even see which airport in NY had the longest departure delay for a certain hour and carrier. You could also look at the average departure delays for all 16 carriers for a certain hour of the day. So, if you know you would like to get a flight at 6 AM for American Airlines out of NY, you can see what the average delay was in 2013. By plotting against the month and day, you can also easily see if a carrier had higher or lower departure delays at an hour of the day in the spring, summer, winter, or fall in 2013. Users can filter by distance of flights as well, to get a better understanding of departure delays for different lengths of flights. Of course, this data is dated, and is more than likely not relative to current times, but these graphs can still be explored to see past trends.

There is a lot of additional information beyond departure delays that can be considered as part of the cognostics that are calculated. There are cognostics like distance_mean and and arr_delay_mean, among many more, that are automatically calculated when the plot is created. This means that you could see the average distance traveled among flights for a certain hour of the day and carrier over the year 2013. You could also easily find the average arrival delay at midnight, for example, using arr_delay_mean, for each airline carrier and compare this to the average departure delay at midnight for each carrier. Most of these diagnostics will be helpful to explore trends over the hours of the day for each airline, since we faceted by hour. Looking at the individual plots is necessary to discover trends about higher or lower departure delays over the specific season/time of year. I was challenged to provide information about what the carriers stand for, since only abbreviations are present in the data. I solved this by adding a cognostic measure as an embedded Wikepedia page for information about each airline carrier. This can provide the user with quick access to what airline the plot is displaying information for. I also created a diagnostic to show how many points in the original dataset were not plotted for each hour and carrier due to being a high outlier, which is another indiciator of which airlines tend to have higher departure delays.

or go to web https://rpubs.com/Olivias3/1172110

Trelliscope_Assignment4_Schultheis

Olivia Schultheis

2024-04-10

Description