data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
?nycflights
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
The three histograms all show the general distribution of the delayed flight counts, but the wider the bin, the more general the result. In the third histogram, the most important data is placed in one bin. This does tell the story that very few flights are majorly delayed. The first histogram tells a similar story but it is slightly more useful. The second histogram is the most specific of the three, and it gives much more information about flights that are only slightly off of on-time.
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
lax_flights %>%
summarise(mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
Note that you can separate the conditions using commas if you want flights that are both headed to SFO and in February. If you are interested in either flights headed to SFO or in February, you can use the | instead of the comma.
sfo_feb_flights. How many flights meet these criteria? ## Answer: 68 flightssfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)
## [1] 68
Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by function to get the same summary stats for each origin airport: # Answer 3 I used IQR because the distribution is widely distributed even though it is moderately symmetric.
ggplot(sfo_feb_flights, aes(x = arr_delay))+
geom_histogram(binwidth = 5)+
ggtitle("Arrival Delays Distribution")+
theme(plot.title = element_text(hjust = 0.5))
sfo_feb_flights %>%
summarise(mean_arr_delay = mean(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
Here, we first grouped the data by origin and then calculated the summary statistics.
arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?United and Delta have the most variable arrival delays, with a tie at 22 for the highest IQR for arrival delays.
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_arr_delay = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
Which month would you expect to have the highest average delay departing from an NYC airport?
Let’s think about how you could answer this question:
group_by months, thensummarise mean departure delays.arrange these average delays in descending ordernycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Also supposed that for you, a flight that is delayed for less than 5 minutes is basically “on time.”" You consider any flight delayed for 5 minutes of more to be “delayed”.
In order to determine which airport has the best on time departure rate, you can
Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable with the mutate function. # Answer LaGuardia has the highest on time percentage.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarize(on_time_dep = sum(dep_type == 'on time')/n()) %>%
arrange(desc(on_time_dep))
The first argument in the mutate function is the name of the new variable we want to create, in this case dep_type. Then if dep_delay < 5, we classify the flight as "on time" and "delayed" if not, i.e. if the flight is delayed for 5 or more minutes.
Note that we are also overwriting the nycflights data frame with the new version of this data frame that includes the new dep_type variable.
We can handle all of the remaining steps in one code chunk:
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.nycflights <- nycflights %>%
mutate(avg_speed = (air_time/60)*distance)
head(nycflights)
avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().ggplot(nycflights, aes(x = distance, y = avg_speed))+
geom_point(size = 1, color = "red")+
ggtitle('Speed Vs. Distance')+
theme(plot.title = element_text(hjust = 0.5))
9. Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are
colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
Flights that leave within the first 30 minutes after schedule can reasonably be expected to have a 30-40% chance of making up for the lost time. Within the first 5 minutes positive or negative seem to be the best opportunities to be able to assume that the flight will arrive on time. A hard cut off would be difficult to determine, but the first 30 minutes is the most reasonable time frame to hold out hope of making it on time.