Lab 2: Introduction to Data

In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013.

We will generate simple graphical and numerical summaries of data on these flights and explore delay times. You can find the instructions for this lab here

# 1. load the library "tidyverse"
library("tidyverse")

# 2. use the read_csv file to read the dataset
nycflights <- read_csv("data/nycflights.csv")

Exercise 1

Question: Experiment with different binwidths to adjust your histogram in a way that will show its important features. You may also want to try using the + scale_x_log10(). What features are revealed now that were obscured in the original histogram?

# Write your code to create a histogram of delays

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 150)

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 15) + 
  scale_x_log10()

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 19936 rows containing non-finite values (`stat_bin()`).

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 0.1) + 
  scale_x_log10()

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 19936 rows containing non-finite values (`stat_bin()`).

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 0.015) + 
  scale_x_log10()

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 19936 rows containing non-finite values (`stat_bin()`).

ANSWER: The bin width refers to how many counts on the x axis are included within one bar shown in the histograms. The smaller the bin width, the smoother the curve. The + scale_x_log10() function transforms the x axis into the logarithm base 10. With this transformation the small numbers are stretched and the large numbers squeezed together. Following this transformation, it makes sense to decrease the bin width below 0.1. Again, smaller values lead to a smoother visualization.

Exercise 2

Question: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

# Insert code for Exercise 2 here 
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2) 

# Example 1: We use this code to visualize only the flights with the destination Los Angeles.
lax_flights <- nycflights %>%
  filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Example 2: We can also get descriptives for entire variables not just certain specific categories of them, like here:
sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())

ANSWER: Number of flights going from NY to SF in February is 32735.

Exercise 3

Question: Describe the distribution of the arrival delays of flights headed to SFO in February, using an appropriate histogram and summary statistics.

# Insert code for Exercise 3 here 
sfo_feb_flights %>% 
  summarise(median_arr = median(arr_delay), iqr_arr = IQR(arr_delay), n_flights = n())

ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ANSWER: The flights are usually on time, if not a few minutes early (typically 11 minutes early to be specifc). Only very few flights in February that depart from SF are late. But usually when a plane is late, it is very late (up to 100, 200 minutes).

Exercise 4

Question: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

# Insert code for the Exercise here
sfo_feb_flights %>% 
  group_by(carrier) %>%
  summarise(median_arr = median(arr_delay), iqr_arr = IQR(arr_delay), n_flights = n())

Answer: The carrier with the most variable arrival time are B6 and DL. They are closely followed by VX.

Exercise 5

Question: Create a list of origin airports and their rate of on-time-departure. Then visualize the distribution of on-time-departure rate across the three airports using a segmented bar plot (see below). If you could select an airport based on on time departure percentage, which NYC airport would you choose to fly out of? Hint: For the segmented bar plot, will need to map the aesthetic arguments as follows: x = origin, fill = dep_type and a geom_bar() layer. Create three plots, one with geom_bar() layer, one with geom_bar(position = "fill") and the third with geom_bar(position = "dodge"). Explain the difference between the three results.

# Insert code for the Exercise here
# Example: How to detect mean delay for each moth
nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))

#classify all flights later than 5 min delayed and all below 5 as punctual in new variable
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

# gives descending list of on-time ratios for each origing
nycflights %>%
  group_by(origin) %>%
  summarise(on_time_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(on_time_dep_rate))

# visualisation 1
nycflights %>%
  ggplot(aes(x = origin, fill = dep_type)) + 
  geom_bar()

# visualisation 2 (Will give proportions & adjusted legends)
nycflights %>%
  ggplot(aes(x = origin, fill = dep_type)) +
  geom_bar(position = "fill") +
  labs(fill = "departure type", y = "proportion")

# visualisation 3 (will seperate bars)
nycflights %>%
  ggplot(aes(x = origin, fill = dep_type)) +
  geom_bar(position = "dodge")

ANSWER: We would choose to fly from the LGA airport, as it has the highest proportion of departing on time.

Exercise 6

Question: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph, or if you are brave, in km/h). Now make a scatter plot of distance vs. avg_speed. Think carefully which of the two variables is the predictor (on the x-axis) and which is the outcome variable (on the y-axis) and explain why you made this choice. Describe the relationship between average speed and distance.

# Insert code for the Exercise here

nycflights <- nycflights %>% 
  mutate(avg_speed = (distance / air_time)*60)

nycflights %>%
  ggplot(aes(x = distance, y = avg_speed)) + 
    geom_point() + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

nycflights %>%
  ggplot(aes(x = distance, y = avg_speed)) + 
    geom_point() + geom_smooth() + scale_x_log10()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Answer: The longer the plane airtime, the faster the airplane will get.

Exercise 7

Question: Replicate the following plot and determine what is the cutoff point for the latest departure delay where you can still have a chance to arrive at your destination on time.

# Insert code for the Exercise here
nycflights %>%
  filter(carrier %in% c("AA", "DL", "UA")) %>%
  filter(arr_delay <= 0) %>%
  ggplot(aes(x = dep_delay, arr_delay, color = carrier)) +
    geom_point() + scale_x_continuous(breaks = -25:100)

Answer: The latest on is for Delta Airlines and was between 63 and 64 minutes (if I eyeball it).