DATA606WK2Lab - Introduction to data

Load packages

library(tidyverse)
library(openintro)

Load data

data(nycflights)

Exercise 1

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 30)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

In the first and second histogram the binwidths (150 and 30) are wide enough that we miss out on some detail. In the third histogram (binwidth 15) we can more clearly see that a minority of flights left early.

Exercise 2

Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

68 flights meet these criteria.

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)

## [1] 68

Exercise 3

Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

Since there are outliers, skewed right, the mean is affected and the median is better. I suggest the quantile function to get all five.

#Histogram showing skew
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 10)

#Lots of statistics for reference but we only need the quantiles shown next
sfo_feb_flights %>%
  summarise(mean_ad   = mean(arr_delay), 
            median_ad = median(arr_delay), 
            n         = n(),
            sd_ad     = sd(arr_delay),
            var_ad    = var(arr_delay),
            iqr_ad    = IQR(arr_delay),
            min_ad    = min(arr_delay),
            max_ad    = max(arr_delay))

## # A tibble: 1 × 8
##   mean_ad median_ad     n sd_ad var_ad iqr_ad min_ad max_ad
##     <dbl>     <dbl> <int> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1    -4.5       -11    68  36.3  1316.   23.2    -66    196

#Just the quantiles
quantile(sfo_feb_flights$arr_delay)

##     0%    25%    50%    75%   100% 
## -66.00 -21.25 -11.00   2.00 196.00

Exercise 4

Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

Delta(DL) has the most variable arrival delays. While it’s tied with UA there are fewer DL flights giving it the higher of the two variability.

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())

## # A tibble: 5 × 4
##   carrier median_ad iqr_ad n_flights
##   <chr>       <dbl>  <dbl>     <int>
## 1 AA            5     17.5        10
## 2 B6          -10.5   12.2         6
## 3 DL          -15     22          19
## 4 UA          -10     22          21
## 5 VX          -22.5   21.2        12

Exercise 5

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

November is the best month! Mean can be influenced by outliers, or flukes in the data, whose occurrences may not be related to the month. In this case, looking at the median data doesn’t give much information because the range in the medium is only 3 among all of the months. A helpful third measure is standard deviation, since between two months with low median and means we would want the one with the least variability. Since we’re only concerned about variability in one direction would kurtosis be relevant?

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay), sd_dd = sd(dep_delay)) %>%
  arrange(desc(median_dd))

## # A tibble: 12 × 4
##    month mean_dd median_dd sd_dd
##    <int>   <dbl>     <dbl> <dbl>
##  1    12   17.4          1  43.0
##  2     6   20.4          0  53.5
##  3     7   20.8          0  47.8
##  4     3   13.5         -1  40.3
##  5     5   13.3         -1  38.3
##  6     8   12.6         -1  39.2
##  7     1   10.2         -2  42.4
##  8     2   10.7         -2  33.1
##  9     4   14.6         -2  43.4
## 10    11    6.10        -2  27.6
## 11     9    6.87        -3  35.3
## 12    10    5.88        -3  29.4

Exercise 6

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

LaGuardia (LGA) at 73%. This is confirmed with the segmented bar plot below

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

Exercise 7

Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

Here we mutate the new column and pull up the top three rows to make sure we got our math right. miles/minutes x 60 minutes/mile. The minutes cancel out

nycflights <- nycflights %>%
  mutate(avg_speed = distance/air_time*60)

select(nycflights, distance, air_time, avg_speed) %>% head(3)

## # A tibble: 3 × 3
##   distance air_time avg_speed
##      <dbl>    <dbl>     <dbl>
## 1     2475      313      474.
## 2     1598      216      444.
## 3     2475      376      395.

Exercise 8

Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

Any horizontal band of flights is likely flights to the same destination and there’s variation in how fast the flights to the same location were, probably due to necessity, weather and weight on board. It seems for short flights there’s correlation between more distance and more speed; However after ~1250 miles of distance it looks like there’s a wall that planes don’t travel faster than. Maybe speed is lower for short flights because planes slow down and circle at take off and landing, or the shorter flights are priced with lower fuel due to a lower speed in mind.

ggplot(data = nycflights, aes(x = avg_speed, y = distance)) +
  geom_point()

Exercise 9

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

The cutoff seems to be that departing ~45 minutes late, you can still get to your destination on time. We could facet these results based on distance and probably come up with a relationship between distance and how many minutes late a flight can depart and still have an on-time arrival. Original chart is below

dl_aa_ua_ot <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA", arr_delay < 5)
ggplot(data = dl_aa_ua_ot, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point() +
  geom_line(aes(x = 45), color = "purple", linetype = "dotted") +
  geom_text(aes(25, -30 , label = "Cut-off for online arrival"), vjust= 0, hjust= 0)