library(tidyverse)
library(openintro)
library(ggplot2)

Exercise 1

All three histograms are right-skewed. In the first histogram, the first bin has the highest count, and the counts decrease with each bin. A bin width smaller than the default 30 would provide more information. In the second histogram, the second bin has the highest count.The first bin shows negative values, which represent early departures. Since more bins are present, this histogram shows that some departures were early, most departures had a short delay, and the rest had longer delays, but those longer delays were rare. When comparing these first two histograms, the second histogram shows more information, and it is a better model for the data. In the third histogram, the bin width is much too large, but it still shows right-skewed data, and some useful information can be taken from it. Overall, the second histogram is the best visualization of the departure delay variable as it displays the most features to understand the data.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 2

68 flights meet the criteria of heading to SFO in February.

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
glimpse(sfo_feb_flights)
## Rows: 68
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day       <int> 18, 3, 15, 18, 24, 25, 7, 15, 13, 8, 11, 13, 25, 20, 12, 27,…
## $ dep_time  <int> 1527, 613, 955, 1928, 1340, 1415, 1032, 1805, 1056, 656, 191…
## $ dep_delay <dbl> 57, 14, -5, 15, 2, -10, 1, 20, -4, -4, 40, -2, -1, -6, -7, 2…
## $ arr_time  <int> 1903, 1008, 1313, 2239, 1644, 1737, 1352, 2122, 1412, 1039, …
## $ arr_delay <dbl> 48, 38, -28, -6, -21, -13, -10, 2, -13, -6, 2, -5, -30, -22,…
## $ carrier   <chr> "DL", "UA", "DL", "UA", "UA", "UA", "B6", "AA", "UA", "DL", …
## $ tailnum   <chr> "N711ZX", "N502UA", "N717TW", "N24212", "N76269", "N532UA", …
## $ flight    <int> 1322, 691, 1765, 1214, 1111, 394, 641, 177, 642, 1865, 272, …
## $ origin    <chr> "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "JFK", "JFK", "JFK…
## $ dest      <chr> "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO…
## $ air_time  <dbl> 358, 367, 338, 353, 341, 355, 359, 338, 347, 361, 332, 351, …
## $ distance  <dbl> 2586, 2586, 2586, 2565, 2565, 2586, 2586, 2586, 2586, 2586, …
## $ hour      <dbl> 15, 6, 9, 19, 13, 14, 10, 18, 10, 6, 19, 8, 10, 18, 7, 17, 1…
## $ minute    <dbl> 27, 13, 55, 28, 40, 15, 32, 5, 56, 56, 10, 33, 48, 49, 23, 2…

Exercise 3

The distribution of the arrival delays is skewed to the right with one peak. Also, the data are centered around -10 minutes, and the range is approximately -120 to 185 minutes. The data contains a few potential outliers.

library(ggeasy)
ggplot(sfo_feb_flights, aes(arr_delay)) +
  geom_histogram(binwidth = 8) +
  labs(title = "NYC Flights Arrival Delay") +
  xlab("Arrival Delay (in minutes)") +
  ylab("Count") +
  ggeasy::easy_center_title()

Exercise 4

Delta Air Lines (DL) and United Air Lines (UA) had the highest interquartile ranges. Since Delta Air Lines had fewer flights, it appears that the arrival delays varied more than United Air Lines. The same spread occurred over a smaller number of flights, so it had more variability overall.

However, since the difference in number of flights was only two, one could argue that they had nearly the same variability.

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n())
## # A tibble: 5 × 4
##   carrier median_dd iqr_dd n_flights
##   <chr>       <dbl>  <dbl>     <int>
## 1 AA            5     17.5        10
## 2 B6          -10.5   12.2         6
## 3 DL          -15     22          19
## 4 UA          -10     22          21
## 5 VX          -22.5   21.2        12

Exercise 5

Without knowing this information, one benefit of using the month with the lowest mean is the mean typically shows the center of the data. It shows the average departure delay. A low average departure delay indicates that particular month will likely be low in the future. One con of using the month with the lowest mean is that the data is not symmetrical - it is right-skewed - so the mean might not show the best center of the data. One benefit of using the month with the lowest median is the median is a robust statistic when the data are skewed. A low median departure delay indicates that future departure delays will be around that value. One con of using the median is that different months have different numbers of flights, so those differences in numbers of flights might affect the median. Another con of using the median is that all the numbers are extremely similar, so the differences between months seem insignificant.

However, the months with the lowest mean and median were determined below, and both results showed October. September and October tied for the lowest median, but September did not have the lowest mean. Therefore, it does not matter which statistic is used in this case because October had the lowest mean and one of the lowest median departure delays.

ggplot(nycflights, aes(dep_delay)) +
  geom_histogram(binwidth = 15) +
  labs(title = "NYC Flights Arrival Delay") +
  xlab("Arrival Delay (in minutes)") +
  ylab("Count") +
  ggeasy::easy_center_title()

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(mean_dd)
## # A tibble: 12 × 2
##    month mean_dd
##    <int>   <dbl>
##  1    10    5.88
##  2    11    6.10
##  3     9    6.87
##  4     1   10.2 
##  5     2   10.7 
##  6     8   12.6 
##  7     5   13.3 
##  8     3   13.5 
##  9     4   14.6 
## 10    12   17.4 
## 11     6   20.4 
## 12     7   20.8
nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay), ) %>%
  arrange(median_dd)
## # A tibble: 12 × 2
##    month median_dd
##    <int>     <dbl>
##  1     9        -3
##  2    10        -3
##  3     1        -2
##  4     2        -2
##  5     4        -2
##  6    11        -2
##  7     3        -1
##  8     5        -1
##  9     8        -1
## 10     6         0
## 11     7         0
## 12    12         1
nrow(nycflights |> filter(month == 1))
## [1] 2610
nrow(nycflights |> filter(month == 2))
## [1] 2286
nrow(nycflights |> filter(month == 3))
## [1] 2869
nrow(nycflights |> filter(month == 4))
## [1] 2781
nrow(nycflights |> filter(month == 5))
## [1] 2821
nrow(nycflights |> filter(month == 6))
## [1] 2732
nrow(nycflights |> filter(month == 7))
## [1] 2742
nrow(nycflights |> filter(month == 8))
## [1] 2880
nrow(nycflights |> filter(month == 9))
## [1] 2681
nrow(nycflights |> filter(month == 10))
## [1] 2884
nrow(nycflights |> filter(month == 11))
## [1] 2733
nrow(nycflights |> filter(month == 12))
## [1] 2716

Exercise 6

La Guardia (LGA) had the best time departure percentage, so I would choose to fly out of there.

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637
library(nycflights13)
airports %>%
  filter(faa == 'LGA')
## # A tibble: 1 × 8
##   faa   name         lat   lon   alt    tz dst   tzone           
##   <chr> <chr>      <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
## 1 LGA   La Guardia  40.8 -73.9    22    -5 A     America/New_York

Exercise 7

The following code creates a new data frame that includes the average speed in miles per hour. Table 1 shows the average speed for a few of the flights.

nycflights <- mutate(nycflights, avg_speed = distance/(air_time/60))
head(nycflights$avg_speed)
## [1] 474.4409 443.8889 394.9468 446.6667 355.2000 318.6957
library(gt)
## 
## Attaching package: 'gt'
## The following object is masked from 'package:openintro':
## 
##     sp500
nycflights |>
  select(avg_speed) |>
  gt_preview() |>
  tab_header(title = "Table 1",
             subtitle = "Preview of Average Speed (mph)")
Table 1
Preview of Average Speed (mph)
avg_speed
1 474.4409
2 443.8889
3 394.9468
4 446.6667
5 355.2000
6..32734
32735 410.8475

Exercise 8

Average speed and distance have a positive non-linear relationship as seen in the graph below. In general, as average speed of the planes increased, distance of the flights increased exponentially.

ggplot(nycflights, aes(x = avg_speed, y = distance, color = month)) +
  geom_point() +
  labs(title = "Average Speed vs. Distance of NYC Flights") +
  xlab("Average Speed (in mph)") +
  ylab("Distance (in miles)") +
  ggeasy::easy_center_title()

Exercise 9

The cutoff point where one can still expect to arrive on time is about 60 minutes. After about a 60-minute departure delay, none of the flights arrived on time.

---
title: "Lab 2: Intro to Data"
author: "Julia Ferris"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(ggplot2)
```

### Exercise 1

All three histograms are right-skewed. In the first histogram, the first bin has the highest count, and the counts decrease with each bin. A bin width smaller than the default 30 would provide more information. In the second histogram, the second bin has the highest count.The first bin shows negative values, which represent early departures. Since more bins are present, this histogram shows that some departures were early, most departures had a short delay, and the rest had longer delays, but those longer delays were rare. When comparing these first two histograms, the second histogram shows more information, and it is a better model for the data. In the third histogram, the bin width is much too large, but it still shows right-skewed data, and some useful information can be taken from it. Overall, the second histogram is the best visualization of the departure delay variable as it displays the most features to understand the data.

```{r comparing histograms}
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)
```


### Exercise 2

68 flights meet the criteria of heading to SFO in February.

```{r sfo-feb-flights}
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
glimpse(sfo_feb_flights)
```


### Exercise 3

The distribution of the arrival delays is skewed to the right with one peak. Also, the data are centered around -10 minutes, and the range is approximately -120 to 185 minutes. The data contains a few potential outliers.

```{r arrival-delays}
library(ggeasy)
ggplot(sfo_feb_flights, aes(arr_delay)) +
  geom_histogram(binwidth = 8) +
  labs(title = "NYC Flights Arrival Delay") +
  xlab("Arrival Delay (in minutes)") +
  ylab("Count") +
  ggeasy::easy_center_title()
```


### Exercise 4

Delta Air Lines (DL) and United Air Lines (UA) had the highest interquartile ranges. Since Delta Air Lines had fewer flights, it appears that the arrival delays varied more than United Air Lines. The same spread occurred over a smaller number of flights, so it had more variability overall. 

However, since the difference in number of flights was only two, one could argue that they had nearly the same variability.

```{r feb_flights-statistics}
sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n())
```


### Exercise 5

Without knowing this information, one benefit of using the month with the lowest mean is the mean typically shows the center of the data. It shows the average departure delay. A low average departure delay indicates that particular month will likely be low in the future. One con of using the month with the lowest mean is that the data is not symmetrical - it is right-skewed - so the mean might not show the best center of the data. One benefit of using the month with the lowest median is the median is a robust statistic when the data are skewed. A low median departure delay indicates that future departure delays will be around that value. One con of using the median is that different months have different numbers of flights, so those differences in numbers of flights might affect the median. Another con of using the median is that all the numbers are extremely similar, so the differences between months seem insignificant.

However, the months with the lowest mean and median were determined below, and both results showed October. September and October tied for the lowest median, but September did not have the lowest mean. Therefore, it does not matter which statistic is used in this case because October had the lowest mean and one of the lowest median departure delays.

```{r departure-delays}
ggplot(nycflights, aes(dep_delay)) +
  geom_histogram(binwidth = 15) +
  labs(title = "NYC Flights Arrival Delay") +
  xlab("Arrival Delay (in minutes)") +
  ylab("Count") +
  ggeasy::easy_center_title()

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(mean_dd)

nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay), ) %>%
  arrange(median_dd)

nrow(nycflights |> filter(month == 1))
nrow(nycflights |> filter(month == 2))
nrow(nycflights |> filter(month == 3))
nrow(nycflights |> filter(month == 4))
nrow(nycflights |> filter(month == 5))
nrow(nycflights |> filter(month == 6))
nrow(nycflights |> filter(month == 7))
nrow(nycflights |> filter(month == 8))
nrow(nycflights |> filter(month == 9))
nrow(nycflights |> filter(month == 10))
nrow(nycflights |> filter(month == 11))
nrow(nycflights |> filter(month == 12))
```


### Exercise 6

La Guardia (LGA) had the best time departure percentage, so I would choose to fly out of there.

```{r best-airport}
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

library(nycflights13)
airports %>%
  filter(faa == 'LGA')
```


### Exercise 7

The following code creates a new data frame that includes the average speed in miles per hour. Table 1 shows the average speed for a few of the flights.

```{r new-variable}
nycflights <- mutate(nycflights, avg_speed = distance/(air_time/60))
head(nycflights$avg_speed)

library(gt)
nycflights |>
  select(avg_speed) |>
  gt_preview() |>
  tab_header(title = "Table 1",
             subtitle = "Preview of Average Speed (mph)")
```

### Exercise 8

Average speed and distance have a positive non-linear relationship as seen in the graph below. In general, as average speed of the planes increased, distance of the flights increased exponentially.

```{r}
ggplot(nycflights, aes(x = avg_speed, y = distance, color = month)) +
  geom_point() +
  labs(title = "Average Speed vs. Distance of NYC Flights") +
  xlab("Average Speed (in mph)") +
  ylab("Distance (in miles)") +
  ggeasy::easy_center_title()
```

### Exercise 9

The cutoff point where one can still expect to arrive on time is about 60 minutes. After about a 60-minute departure delay, none of the flights arrived on time.

```{r plot-to-replicate, echo=FALSE, fig.show="asis", fig.width=7, fig.height=4}
dl_aa_ua <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = dl_aa_ua, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point()
```

