library(tidyverse)
library(openintro)
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ~
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10~
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, ~
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940~
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, ~
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, ~
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, ~
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", ~
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", ~
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, ~
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA~
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA~
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,~
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,~
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6~
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24~
Histograms provide a view of the data density. Smaller the binwidth, better is the shape of the data distribution, which make it more convenient to describe the result.
Yes, more the data gets splits in different bins, better features are revealed from the data.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)
## [1] 68
68 flights headed to SFO in February.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 10)
The distribution of the arrival delays of SFO flights is unimodal and right skewed with a long tail to the right.
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_arrdelay = median(arr_delay), iqr_arrdelay = IQR(arr_delay), n_flights = n())
Carrier VX had the most variable delay value. In terms of number of flights, UA had the most delays.
monthly_data <- nycflights %>%
group_by(month) %>%
summarise(median_depdelay = median(dep_delay), mean_depdelay = mean(dep_delay), iqr_depdelay = IQR(dep_delay), minimum=min(dep_delay), maximum=max(dep_delay), variance = maximum - minimum, n_flights = n())
Lets first group by the flights by month, to study its distribution comparing the median departure delay vs mean departure delay. The number of flights every month have been more or less similar, even distribution. Given the uniform distribution of the flight data across the months as per as number of flight, the mean is a better measure of central tendency. Based on the monthly_data set, October seemed to be the best month to travel.
If the data distribution was skewed heavily across the months, then median would have been a better measure of central tendency.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
LGA would be the preferred NYC airport based on the punctuality of the departures.
nycflights <- nycflights %>%
mutate(avg_speed = (nycflights$distance/nycflights$air_time))
attach(nycflights)
plot(avg_speed, distance, main="Scatterplot",
xlab="Average Speed ", ylab="Distance", pch=19)
filter_flights <- nycflights %>%
filter((carrier == "AA") | ( carrier == 'DL') | (carrier == 'UA'))
qplot(dep_delay, arr_delay, main="Scatterplot",
xlab="Departure Delay", ylab="Arrival Delay",colour = carrier, data = filter_flights)