library(tidyverse)
library(openintro)

Exercise 1

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another? The histograms number 1 and number 2 which have smaller bin sizes depict the data in more specific details as compared to histogram number 3. Histogram number 3 with the 150 bin size feels as if it depicts a more general picture of the data. For example, if we were comparing to a staircase, histogram number 1 and histogram number 2 have more “steps” than histogram number 3. without a doubt features are revealed in the first two histograms in more detail. For example histogram number 1 and histogram number 2 show in detail a couple of ‘surges’ after the first initial surge. In histogram number 3 this is not possible to view in detail since it feels all the data is bunched together to depict a more general picture[little]

lax_flights <- nycflights %>%
  filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Excercise 2

Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

68 flights meet our criteria of flights headed to SFO in Feb. We can view this quickly with a glimpse func

sfo_feb_flights <- nycflights %>%
  filter(month == 2, dest == "SFO")

glimpse(sfo_feb_flights)
## Rows: 68
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ month     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ day       <int> 18, 3, 15, 18, 24, 25, 7, 15, 13, 8, 11, 13, 25, 20, 12, ...
## $ dep_time  <int> 1527, 613, 955, 1928, 1340, 1415, 1032, 1805, 1056, 656, ...
## $ dep_delay <dbl> 57, 14, -5, 15, 2, -10, 1, 20, -4, -4, 40, -2, -1, -6, -7...
## $ arr_time  <int> 1903, 1008, 1313, 2239, 1644, 1737, 1352, 2122, 1412, 103...
## $ arr_delay <dbl> 48, 38, -28, -6, -21, -13, -10, 2, -13, -6, 2, -5, -30, -...
## $ carrier   <chr> "DL", "UA", "DL", "UA", "UA", "UA", "B6", "AA", "UA", "DL...
## $ tailnum   <chr> "N711ZX", "N502UA", "N717TW", "N24212", "N76269", "N532UA...
## $ flight    <int> 1322, 691, 1765, 1214, 1111, 394, 641, 177, 642, 1865, 27...
## $ origin    <chr> "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "JFK", "JFK", "...
## $ dest      <chr> "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "...
## $ air_time  <dbl> 358, 367, 338, 353, 341, 355, 359, 338, 347, 361, 332, 35...
## $ distance  <dbl> 2586, 2586, 2586, 2565, 2565, 2586, 2586, 2586, 2586, 258...
## $ hour      <dbl> 15, 6, 9, 19, 13, 14, 10, 18, 10, 6, 19, 8, 10, 18, 7, 17...
## $ minute    <dbl> 27, 13, 55, 28, 40, 15, 32, 5, 56, 56, 10, 33, 48, 49, 23...

Excercise 3

Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution. Upon viewing the histogram and summary statistics we witness that overal general delay of the flights time is pretty common. Delay time comes in with a mean of around ’(-4.50). However the median(-11) is pretty far from the mean number. It leads one to believe that the small variety in those flights with delays nearing 196 might have impacted this.

ggplot(data = sfo_feb_flights, mapping = aes(x = arr_delay)) +
        geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(sfo_feb_flights$arr_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -66.00  -21.25  -11.00   -4.50    2.00  196.00

Excercise 4

Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays? Delta and United Airlines have the highest IQR of arrival delays

###The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. 

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n()) %>%
  arrange(desc(median_dd))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   carrier median_dd iqr_dd n_flights
##   <chr>       <dbl>  <dbl>     <int>
## 1 AA            5     17.5        10
## 2 UA          -10     22          21
## 3 B6          -10.5   12.2         6
## 4 DL          -15     22          19
## 5 VX          -22.5   21.2        12

Excercise 5

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices? We see month 10[october] has the lowest mean and month 9[sep]/month 10[oct] have [-3] medians. Mean we know is the average, basically all the numbers added and then divided by the total amount of numbers. On the other hand median we could state is the middle value in the list of numbers. Extremes in the departure delays will heavily influence in the mean. The median to my understanding may provide a more general picture/idea of how likely the flight is to depart with delay

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
##    month mean_dd
##    <int>   <dbl>
##  1     7   20.8 
##  2     6   20.4 
##  3    12   17.4 
##  4     4   14.6 
##  5     3   13.5 
##  6     5   13.3 
##  7     8   12.6 
##  8     2   10.7 
##  9     1   10.2 
## 10     9    6.87
## 11    11    6.10
## 12    10    5.88
nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay)) %>%
  arrange(desc(median_dd))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
##    month median_dd
##    <int>     <dbl>
##  1    12         1
##  2     6         0
##  3     7         0
##  4     3        -1
##  5     5        -1
##  6     8        -1
##  7     1        -2
##  8     2        -2
##  9     4        -2
## 10    11        -2
## 11     9        -3
## 12    10        -3

Excercise 6

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of? Based on time departature percentage LGA would be the best pick

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637
qplot( fill = dep_type, x = origin, data = nycflights, geom = "bar")

### Excercise 7 Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes. Hint is telling us that average speed can be calculated as distance divided by number of hours of travel. Since air_time is given in minutes we divide by 60[1 hour]

nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))

Excercise 8

Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point() It appears as distance increases so does avg speed. Although at around 2,000+ distance, avg speed is around 500 to 600 steady

ggplot(nycflights, aes(x=distance, y=avg_speed)) + geom_point()

Excercise 9.

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time. Estimatin possibly cutoff point for depature delay where you can still expect to get your destination on time is around 45 minutes

ua_aa_dl_flights <- nycflights %>%
    filter(carrier == "UA" |carrier == "AA" | carrier == "DL")

ggplot(ua_aa_dl_flights, aes( color=carrier, x=dep_delay, y=arr_delay,)) + geom_point()

