library(nycflights23)
library(ggplot2)
library(moderndive)
library(tibble)
2.1
glimpse(flights)
## Rows: 435,352
## Columns: 19
## $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 1, 18, 31, 33, 36, 503, 520, 524, 537, 547, 549, 551, 5…
## $ sched_dep_time <int> 2038, 2300, 2344, 2140, 2048, 500, 510, 530, 520, 545, …
## $ dep_delay <dbl> 203, 78, 47, 173, 228, 3, 10, -6, 17, 2, -10, -9, -7, -…
## $ arr_time <int> 328, 228, 500, 238, 223, 808, 948, 645, 926, 845, 905, …
## $ sched_arr_time <int> 3, 135, 426, 2352, 2252, 815, 949, 710, 818, 852, 901, …
## $ arr_delay <dbl> 205, 53, 34, 166, 211, -7, -1, -25, 68, -7, 4, -13, -14…
## $ carrier <chr> "UA", "DL", "B6", "B6", "UA", "AA", "B6", "AA", "UA", "…
## $ flight <int> 628, 393, 371, 1053, 219, 499, 996, 981, 206, 225, 800,…
## $ tailnum <chr> "N25201", "N830DN", "N807JB", "N265JB", "N17730", "N925…
## $ origin <chr> "EWR", "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "EWR",…
## $ dest <chr> "SMF", "ATL", "BQN", "CHS", "DTW", "MIA", "BQN", "ORD",…
## $ air_time <dbl> 367, 108, 190, 108, 80, 154, 192, 119, 258, 157, 164, 1…
## $ distance <dbl> 2500, 760, 1576, 636, 488, 1085, 1576, 719, 1400, 1065,…
## $ hour <dbl> 20, 23, 23, 21, 20, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6,…
## $ minute <dbl> 38, 0, 44, 40, 48, 0, 10, 30, 20, 45, 59, 0, 59, 0, 0, …
## $ time_hour <dttm> 2023-01-01 20:00:00, 2023-01-01 23:00:00, 2023-01-01 2…
glimpse(envoy_flights)
## Rows: 357
## Columns: 19
## $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 4, 5, 6, 7, 8, 9, 10, …
## $ dep_time <int> 1405, 1405, 1630, 1505, 1405, 1405, 1456, 1403, 1408, 1…
## $ sched_dep_time <int> 1410, 1410, 1410, 1410, 1410, 1410, 1410, 1410, 1410, 1…
## $ dep_delay <dbl> -5, -5, 140, 55, -5, -5, 46, -7, -2, -5, -2, -3, -4, -2…
## $ arr_time <int> 1541, 1543, 1809, 1647, 1535, 1533, 1609, 1545, 1528, 1…
## $ sched_arr_time <int> 1555, 1555, 1555, 1555, 1555, 1555, 1555, 1555, 1555, 1…
## $ arr_delay <dbl> -14, -12, 134, 52, -20, -22, 14, -10, -27, 12, -8, 0, -…
## $ carrier <chr> "MQ", "MQ", "MQ", "MQ", "MQ", "MQ", "MQ", "MQ", "MQ", "…
## $ flight <int> 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 9…
## $ tailnum <chr> "N222NS", "N283NN", "N220NN", "N235NN", "N235NN", "N240…
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR",…
## $ dest <chr> "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD",…
## $ air_time <dbl> 129, 133, 121, 135, 109, 113, 109, 119, 117, 122, 121, …
## $ distance <dbl> 719, 719, 719, 719, 719, 719, 719, 719, 719, 719, 719, …
## $ hour <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 16, 13, 13,…
## $ minute <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 39, 39, 58, 39, 39,…
## $ time_hour <dttm> 2023-01-01 14:00:00, 2023-01-02 14:00:00, 2023-01-03 1…
2.2 If a flight departs late, it is much more likely to arrive late. Actual flight times tend to stray very little from predicted times, so any arrival delay is likely caused by a departure delay.
2.3 I would expect visib to have a negative relationship with dep_delay. I expect as visibility decreases, dep_delay is more likely to increase and vice versa.
2.4 There is a cluster near 0,0 because flights are designed to be on time, and a flight with no departure delay is less likely to have an arrival delay.
2.5 dep_delay and arr_delay are very closely related and follow a line of best fit that could be plotted. It is also very rare for the arr_delay to be smaller than the dep_delay, but there are some examples of the opposite relationship.
2.6
ggplot(data=envoy_flights, mapping=aes(x=air_time, y=arr_delay)) +
geom_point()
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
2.7 The alpha argument allows us to visualize how many data points exist at each value. It is valuable when certain values are highly populated and we need a method to show that.
2.8 The region with the alpha value applied gives a better idea of how frequently flights are delayed and how long delays might last. It shows that there is an emphasis around 0,0, meaning that delays are often non-existent, or short.
2.9
glimpse(weather)
## Rows: 26,207
## Columns: 15
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JF…
## $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ temp <dbl> 48.0, 48.2, 49.0, 49.0, 49.0, 48.0, 46.4, 46.0, 48.0, 47.0,…
## $ dewp <dbl> 48.0, 48.2, 49.0, 49.0, 49.0, 48.0, 46.4, 46.0, 48.0, 47.0,…
## $ humid <dbl> 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 100…
## $ wind_dir <dbl> 0, 190, 190, 250, 170, 0, 250, 230, 260, 250, 240, 260, 260…
## $ wind_speed <dbl> 0.00000, 4.60312, 5.75390, 5.75390, 8.05546, 0.00000, 9.206…
## $ wind_gust <dbl> 0.000000, 5.297178, 6.621473, 6.621473, 9.270062, 0.000000,…
## $ precip <dbl> 1e-02, 1e-02, 1e-04, 2e-02, 1e-04, 1e-04, 0e+00, 0e+00, 0e+…
## $ pressure <dbl> 1010.2, 1009.2, 1009.0, 1008.0, 1007.8, 1007.6, 1007.3, 100…
## $ visib <dbl> 0.25, 2.50, 0.25, 4.00, 0.75, 0.75, 0.24, 0.50, 8.00, 5.00,…
## $ time_hour <dttm> 2023-01-01 00:00:00, 2023-01-01 01:00:00, 2023-01-01 02:00…
glimpse(early_january_2023_weather)
## Rows: 360
## Columns: 15
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW…
## $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ temp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dewp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ humid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ wind_dir <dbl> 0, 210, 0, 210, 250, 130, 230, 210, 210, 220, 200, 220, 240…
## $ wind_speed <dbl> 0.00000, 2.30156, 0.00000, 5.75390, 8.05546, 4.60312, 8.055…
## $ wind_gust <dbl> 0.000000, 2.648589, 0.000000, 6.621473, 9.270062, 5.297178,…
## $ precip <dbl> NA, NA, 0.01, 0.01, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pressure <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ visib <dbl> 0.5, 2.5, 2.5, 3.0, 5.0, 4.0, 3.0, 5.0, 4.0, 5.0, 5.0, 7.0,…
## $ time_hour <dttm> 2023-01-01 03:00:00, 2023-01-01 04:00:00, 2023-01-01 05:00…
Weather has over 26,000 rows while early_january_2023_weather only has 360. Weather also includes all airports whereas early_january_2023_weather includes only Newark.
2.10 Time_hour shows the progression of time in comparison to the year whereas the hour variable only shows comparison to other hours on arbitrary days.
2.11 Without a clear ordering of the horizontal axis, the plot could have multiple y values for the same x value and the line would be nonsensical
2.12 A linegraph clearly visualizes how the data changes as time progresses and shows trends over time. This is a very common use of data.
2.13
ggplot(data=early_january_2023_weather, mapping=aes(x=time_hour, y=visib)) +
geom_line()
2.14 That there is a much more even distribution of wind speeds between ~5 and ~12 than the 30 bins plot showed us.
2.15 I would describe it as skewed somewhat to the right.
2.16 I would guess the “center” value to be about 10. Most of the data is concentrated around 10, and the bin including 10 is the highest one in the 5 bin plot.
2.17 I would describe it as somewhat close to the center. The highest bins are all close to the center and there are very few, if any, outliers.
2.18 I notice that each month has a similar distribution of data, and the bin at x=10 is always the peak. I also notice that some months have up to 6 bins while others have only 4. A faceted plot helps us visualize how specific factors can affect the data we’re measuring. It can also show how data changes over time.
2.19. 1-12 represent months of the year. 10, 20, and 30 represent wind speeds in miles per hour.
2.20 Faceted plots would not work well when the variable you are using to split up the data has a lot of entries. For example, splitting up by days would result in 365 facets which would not be very useful. Faceted plots would also offer very little value when you’re splitting up by a variable with continuous values. For example, splitting up by pressure would produce about as many facet plots as data points.
2.21 No. This is because there are so many data points in the weather dataset and wind_speed falls within a pretty consistent range throughout the dataset. It is also consistently concentrated around the same speeds.
2.22 The dots correspond to outliers in the data. These could be explained by storms or unusual weather patterns that brought stronger than normal winds.
2.23 February and March, followed by November and December, seem to have the highest variability in wind speed. These months may experience the strongest weather events or may see more extreme seasonal changes throughout the course of the month.
2.24 Because pressure doesn’t follow a clear numbering system like the months. It is a continuous value and includes long chains of decimal points. Using factor would produce too many unique boxplots to be helpful.
2.25 Outliers present themselves as individual dots in boxplots. This makes them very easy to point out. However, in histograms, they show up in bins and it is unclear what their exact value is or how many data points might be in that bin if the y-axis has a relatively large scale.