library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
How delayed were flights that were headed to Los Angeles? How do departure delays vary by month? Which of the three major NYC airports has the best on time percentage for departing flights?
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
The difference between the histograms is the binwidth. With a binwidth of 150, we don’t see many intricacies of the graph, but we do get an understanding of general distribution. Note that the count on the Y-Axis changes depending on the binwidth.
With a binwidth of 15, we still see the general distribution, but we also see there are more dep_delay records that are less than a 5000 count. We can see a more detailed distribution than a binwidth of 150.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
There are 68 records of flights going to SFO in February.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
summarise(sfo_feb_flights, n())
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
There is a relatively normal distribution for the arrival delays of flights going to SFO in february. There are a small amount of outliers.
— Is this distribution skewed left, even though the outliers are on the right?
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15)
sfo_feb_flights %>%
summarise(mean_ad = mean(arr_delay),
median_ad = median(arr_delay),
n = n(),
minimum_ad= min(arr_delay),
maximum_ad= max(arr_delay))
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
The carrier with the most variable arrival delays is American Airlines. Delta Airlines, United Airlines and Virgin America all have more flights than AA, but given that the median for those airlines is negative, it’s safe to assume that AT LEAST 50% of their flights are early or on time. The opposite is said for AA, at least 50% of their flights are late, so they have the most variable delays of flights going to SFO in February.
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_ad = median(arr_delay),
iqr_ad = IQR(arr_delay),
n_flights = n())
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
This questions boils down to when to use the mean and when to use the median. The median is ideal to use when there are definitive outliers that will ruin the arithmatic mean, otherwise use the mean.
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
If we’re basing the decision solely on On Time Departure Percentage, then the choice would be Laguardia Airport. If we plot and look at the distribution, one might lean towards EWR, but EWR has a larger amount of flights, so the percetange of on time flights is actually lower, as shown in the tibble below.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
ggplot(nycflights, aes(x=origin, fill = dep_type)) +
geom_bar()
## Exercise 7 Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
nycflights <- nycflights %>%
mutate(avg_speed = distance/(air_time/60))
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
Speed and distance seem to have a healthy relationship!
ggplot(nycflights, aes(x = avg_speed, y = distance)) +
geom_point()
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
specific_flights <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(specific_flights, aes(x=dep_delay, y=arr_delay, color = carrier)) +
geom_point()
# An attempt to find the cutoff point for departure delays where we can still expect to get to our destination on time. It would work if performing math on time wasn't so weird. The concept is subtracting departure_time from arrival_time would give us a value for air_time. There is already a column for air_time, so we can subtract the two for authenticity. Then we subtract the two values of delays (since there are some negative values, we can account for when flights leave early as well. If the final value is positive, we can say that the delays didn't interfere with the arrival to the destination, which will help us find a cutoff point.)
nycflights %>%
mutate(on_time = arr_time - dep_time - air_time - dep_delay - arr_delay)