Learning Checks 1-25:

1: These data frames differ in the fact there there are tens of thousands of more flights than envoy flights. In addition, flights include multiple airlines and many destinations while envoy flights only has a single carrier and much less variation.

2: As arrival delays pile up and planes have trouble arriving on time, this inherently increases departure delays because those same planes are also needed for future trips. Thus, a same plane might have 2-3 trips planned in a row, but an arrival delay on the first stop may also cause departure delays for the remaining scheduled flights.

  1. One variable I’d expect to have a negative relationship would be visibility as a high visibility would result in a low number of departure delays. Another variable would be pressure as low pressure days typically indicate storms which would increase the number of departure delays.

  2. There is a cluster near 0,0 because most airlines schedule very tightly and often times are able to arrive and depart exactly on schedule. For Envoy Air flights, (0,0) represents a smooth flight with no major issues and normal operation conditions. In addition, Envoy air is a regional carrier meaning that it has shorter routes which could also reduce late arrivals and departures.

  3. Some of the features that really stick out is that there were some extreme outliers. They still followed the general trend line, but were extremely larger than most of the observations. I wonder how those delays were dealt with and what was the cause behind the delays. In addition, I noticed that there were still a decent chunk of observations that were around 25 minutes late but still departed on time.

library(tidyverse)      # loads ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)
library(nycflights23)
ggplot(data = envoy_flights, mapping = aes(x = month, y = arr_delay)) + 
  geom_point()
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

7: Setting the alpha argument is useful because it controls the transparency of point is helpful when looking at many observations. You can learn a lot more about density as well as potential hidden clusters that regular scatter plots might not show.

8: Looking at Figure 2.4, it becomes very apparent that a majority of departure and arrival delays lie in between that [-25,0] indicating that flights more often than not leave and arrive on time if not earlier. That region has definitely changed as the density of the region is much more clear in the new plot compared to the old one.

9: The number of rows is the biggest difference because the first data set looks at the totality over 2023 while the second data set only looks at the weather in the beginning of January.

10: Time_hour uniquely identifies the time because it is a date-time variable while hour is a numeric integer.

11: Line Graphs are most useful when they assume an order. Thus, when there is no order, this could cause the audience to interpret false trends.

12: Line graphs are frequently used with time because observing variables over time is a common and excellent way to discover relationships. As a result, line graphs typically show trends over time to highlights concepts such as shocks or seasonality.

13:

ggplot(data = early_january_2023_weather, 
       mapping = aes(x = time_hour, y = visib)) +
  geom_line()

14: 30 bins is able to show the data at a more granular level and highlight the irregularities. 20 bins shows a much smoother looking data and the overall shape is also easier to see.

15: I would say the distribution of wind speeds are definitely skewed right as lower wind speeds are much more common than high ones.

16: I would guess that the center value of this distribution would be around 9-10 because the distribution is skewed right, however there are still a ton of observations in that 10-12 range meaning that the center should be around there.

17: Some values are spread out, however the majority of observations are clustered near the center. This is because wind speeds have always remained consistent and it is hard to find a factor that would dramatically impact them.

18: Faceted plots help you see relationships that may show up gradually. In addition, it allows you to see how the strength of each relationship changes by facet. It helps us compare the same relationship under different conditions which can help us see two variables in a better sense.

19: In this chart, each facet represented a month while the axis represented wind speeds.

20: Faceted plots do not work well when the facets are split too small, there are highly unbalanced groups and looking at continuous variables for facets. An example of one of these issues would be faceting on a day to day basis over the course of year. This would not be optimal and the abundance of facets could result in messy and useless data.

21: Wind_speed shows a good bit of variability when looking at the monthly data. On a yearly perspective, we see that seasons often time cause the most variation as there are higher winds in the beginning of the year than any other part of the year. The extreme observations can also often be explained by natural disasters which are relatively rare events

22: The dots at the top of the plot for January correspond to the outliers of wind speed in January.Thus, they represent wind speeds of 23+ mph. January is typically the peak of winter and there tends to be a lot of snow storms. As a result, there may have been a number of storms to cause those outliers.

23: February seems to have the highest variation in wind speeds when looking at the box and whiskers plot. The reasons behind this claim is that it has the largest box portion which stretches from 5 mph to 15 mph. Thus, this is the biggest difference between the 1st and 3rd quantile compared to the rest of the months. In addition, it also has extremely long whiskers with the largest outliers of the year indicating that there are plenty of days with high wind speeds which ultimately would increase variability.

24: Box plots would not work for pressure because it is a continuous variables which means hundreds of thousands of potential categories. As a result, the information would be cluttered and it would be hard to derive patterns from such a box plot.

25: It is easier to identify an outlier in a box plot because they label it clearly through dots. However, an outlier in a faceted histogram would be labeled the same and might be hard to identify.