library(tidyverse)
library(openintro)
data(nycflights)
Exercise 1:
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
In the three histograms the bin sizes, that is the number of records that are grouped together, differ. Because of these differences the scale of the vertical y-axis also change.
The smaller bin sizes in histograms 1 and 2 allow for more detail to be shown that is hidden in the larger bin sizes in histogram 3. In this cases seeing that detail is important, particularly for histogram 2 where there is a number of records preceding the spike that could be important information, otherwise hidden in histograms 1 and 3, to better interpret these results.
# Histogram 1
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram 2
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)

#Histogram 3
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)

Exercise 2:
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights %>%
summarise(mean_dd = mean(arr_delay),
median_dd = median(arr_delay),
n = n())
## # A tibble: 1 x 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 -4.5 -11 68
Using the “summarise” function, there are 68 records with a destination as SFO in the month of February.
Exercise 3:
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(sfo_feb_flights$arr_delay, skew = FALSE)
## vars n mean sd min max range se
## X1 1 68 -4.5 36.28 -66 196 262 4.4
The distribution of these data for arrival delays appears somewhat normal. There are a number of outliers of flights with extreme delays. So much so that it pulls the mean of the set (-4.5 minutes) away from the median (-11 minutes).
Exercise 4:
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n()) %>%
arrange(desc(iqr_dd))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## carrier median_dd iqr_dd n_flights
## <chr> <dbl> <dbl> <int>
## 1 DL -15 22 19
## 2 UA -10 22 21
## 3 VX -22.5 21.2 12
## 4 AA 5 17.5 10
## 5 B6 -10.5 12.2 6
Based on the table above the carriers from these data that have the highest variability are United Airlines and Delta. (As someone who would normally travel quite a bit for work and who uses UA as my carrier, I can attest to this variance!)
Exercise 5:
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(mean_dd)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month mean_dd
## <int> <dbl>
## 1 10 5.88
## 2 11 6.10
## 3 9 6.87
## 4 1 10.2
## 5 2 10.7
## 6 8 12.6
## 7 5 13.3
## 8 3 13.5
## 9 4 14.6
## 10 12 17.4
## 11 6 20.4
## 12 7 20.8
nycflights %>%
group_by(month) %>%
summarise(median_dd = median(dep_delay)) %>%
arrange(median_dd)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month median_dd
## <int> <dbl>
## 1 9 -3
## 2 10 -3
## 3 1 -2
## 4 2 -2
## 5 4 -2
## 6 11 -2
## 7 3 -1
## 8 5 -1
## 9 8 -1
## 10 6 0
## 11 7 0
## 12 12 1
Arranging these data by month and by mean and median the outcome is that October has the lowest mean (5.9 minutes), while September has the lowest median (-3 minutes). Using the mean to make a determination on when to fly will be influenced more by outliers in the data set. As we’ve seen previously, there are a some really large delays, but also some really early departures. Using the median, by contrast, shows that 50% of the flights are below a certain point, in the case of September an arrival delay of -3 minutes. One has a 50% chance of being at or below -3 minutes. Airlines are not prone to leaving extraordinarily early, so using median, particularly with a number around zero minutes delayed, would seem to make a lot of sense as it shows more consistency.
Exercise 6:
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
nycOrigin <- nycflights %>%
filter(origin == "EWR" | origin == "JFK" | origin == "LGA")
nycOrigin$onTimeFac <- ifelse(nycOrigin$dep_delay <= 0, "On Time", "Late")
prop.table(table(nycOrigin$origin, nycOrigin$onTimeFac))*100
##
## Late On Time
## EWR 16.22422 19.73423
## JFK 12.70200 20.58653
## LGA 10.17260 20.58042
Creating a new subset of the data to limit the view to NYC area airport, I also created a new column for on time and late flights, any flight at or below zero minutes delayed are labeled “On Time”, else they are “Late.”
Creating a proportional table by that new column and origin broke out the on time and late flights. An error I cannot seem to shake is that the rows do not add up to 100%. I tried a number of different checks, but it still persists.
Based on these data, JFK just slightly edges out LGA for more departures on time.
Exercise 7:
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
nycflights <- nycflights %>%
mutate(avg_speed = distance / air_time)
Exercise 8:
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
ggplot(nycflights, aes(x=distance, y=avg_speed)) +
geom_point()

Exercise 9:
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
Ex9 <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(Ex9, aes(x=dep_delay, y=arr_delay, color=carrier)) +
geom_point()

From the scatter plot above, I would say at roughly a hour (60 minutes) would be the cutoff point for a departure delay that would not affect arrival time.
---
title: "Lab Week 2: Intro to Data"
author: "Ian Costello"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
data(nycflights)
```

### Exercise 1:

*Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?*

In the three histograms the bin sizes, that is the number of records that are grouped together, differ. Because of these differences the scale of the vertical y-axis also change. 

The smaller bin sizes in histograms 1 and 2 allow for more detail to be shown that is hidden in the larger bin sizes in histogram 3. In this cases seeing that detail is important, particularly for histogram 2 where there is a number of records preceding the spike that could be important information, otherwise hidden in histograms 1 and 3, to better interpret these results. 


```{r flights dist}

# Histogram 1
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

# Histogram 2
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

#Histogram 3
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)
```

### Exercise 2:

*Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?*

```{r feb flights create}
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

sfo_feb_flights %>%
  summarise(mean_dd   = mean(arr_delay), 
      median_dd       = median(arr_delay), 
      n               = n())
```
Using the "summarise" function, there are 68 records with a destination as SFO in the month of February.

### Exercise 3:

*Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.*

```{r feb sum viz}
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram()

library(psych)
describe(sfo_feb_flights$arr_delay, skew = FALSE)

```

The distribution of these data for arrival delays appears somewhat normal. There are a number of outliers of flights with extreme delays. So much so that it pulls the mean of the set (-4.5 minutes) away from the median (-11 minutes). 

### Exercise 4:

*Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?* 
```{r feb carrier summary}
sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n()) %>%
  arrange(desc(iqr_dd))
```

Based on the table above  the carriers from these data that have the highest variability are United Airlines and Delta. (As someone who would normally travel quite a bit for work and who uses UA as my carrier, I can attest to this variance!)

### Exercise 5:

*Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?*

```{r nyc month delays}
nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(mean_dd)

nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay)) %>%
  arrange(median_dd)
```

Arranging these data by month and by mean and median the outcome is that October has the lowest mean (5.9 minutes), while September has the lowest median (-3 minutes). Using the mean to make a determination on when to fly will be influenced more by outliers in the data set. As we've seen previously, there are a some really large delays, but also some really early departures. Using the median, by contrast, shows that 50% of the flights are below a certain point, in the case of September an arrival delay of -3 minutes. One has a 50% chance of being at or below -3 minutes. Airlines are not prone to leaving extraordinarily early, so using median, particularly with a number around zero minutes delayed, would seem to make a lot of sense as it shows more consistency.

### Exercise 6:

*If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?*

```{r nyc orgin subset}
nycOrigin <- nycflights %>%
  filter(origin == "EWR" | origin == "JFK" | origin == "LGA")

nycOrigin$onTimeFac <- ifelse(nycOrigin$dep_delay <= 0, "On Time", "Late")

prop.table(table(nycOrigin$origin, nycOrigin$onTimeFac))*100

```
Creating a new subset of the data to limit the view to NYC area airport, I also created a new column for on time and late flights, any flight at or below zero minutes delayed are labeled "On Time", else they are "Late."

Creating a proportional table by that new column and origin broke out the on time and late flights. An error I cannot seem to shake is that the rows do not add up to 100%. I tried a number of different checks, but it still persists. 

Based on these data, JFK just slightly edges out LGA for more departures on time. 

### Exercise 7:

*Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.*

```{r}
nycflights <- nycflights %>%
  mutate(avg_speed = distance / air_time)
```


### Exercise 8:

*Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().*

```{r}
ggplot(nycflights, aes(x=distance, y=avg_speed)) +
  geom_point()
```

### Exercise 9:

*Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.*

```{r}
Ex9 <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")

ggplot(Ex9, aes(x=dep_delay, y=arr_delay, color=carrier)) +
  geom_point()
```

From the scatter plot above, I would say at roughly a hour (60 minutes) would be the cutoff point for a departure delay that would not affect arrival time.  