library(tidyverse)
library(openintro)
data(nycflights)The difference between these three histograms is the size of the bins. The bin size determines the size of the range each bar has, so binwidth=15 had smaller ranges divided up compared to binwidth=20
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)There were 68 flights headed to San Francisco in February.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
NROW(sfo_feb_flights)## [1] 68
The arrival time data seems to be right-skewed given by the summary data.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 40)sfo_feb_flights %>%
group_by(origin) %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())## # A tibble: 2 x 4
## origin median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <int>
## 1 EWR -15.5 17.5 8
## 2 JFK -10.5 22.8 60
DL & UA has the most variable arrival delays.
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())## # A tibble: 5 x 4
## carrier median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
For Mean to be used, you are seeing the overall average for the flight delays. Its Pros are the overall flights are considered in the measure. For median as the measure, you will see the “middle” of overall delays. Its Pros is that it discounts outiler data, where some large flight delays might distort the month’s average.
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(mean_dd)## # A tibble: 12 x 2
## month mean_dd
## <int> <dbl>
## 1 10 5.88
## 2 11 6.10
## 3 9 6.87
## 4 1 10.2
## 5 2 10.7
## 6 8 12.6
## 7 5 13.3
## 8 3 13.5
## 9 4 14.6
## 10 12 17.4
## 11 6 20.4
## 12 7 20.8
nycflights %>%
group_by(month) %>%
summarise(median_dd = median(dep_delay)) %>%
arrange(median_dd)## # A tibble: 12 x 2
## month median_dd
## <int> <dbl>
## 1 9 -3
## 2 10 -3
## 3 1 -2
## 4 2 -2
## 5 4 -2
## 6 11 -2
## 7 3 -1
## 8 5 -1
## 9 8 -1
## 10 6 0
## 11 7 0
## 12 12 1
I would choose LGA to fly out from as it has the largest On-time Departure rate. Majority of LGA’s flights depart on time with its departure status of ~73%.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))## # A tibble: 3 x 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
The formula for average speed: distance/ hours of travel=(minutes of air time/ 60 minutes)
nycflights<-nycflights%>%mutate(avg_speed = distance/(air_time/60))There seems to be a positive relationship between average speed vs distance, as distance increase the average speed will increase as well.
ggplot(data=nycflights, aes(y=avg_speed,x=distance ,colour=origin))+geom_point()+labs(x="Average Speed",y="Distance")I only want a chart with flights from these certain carriers: American Airlines, Delta Airlines, and United Airlines. I will create a separate chart with only data from these brands and create a scatter plot. It appears the cutoff is around 30 minutes where you can arrive to your destination on time.
special.air<-nycflights%>%filter(carrier=="AA"|carrier=="DL"|carrier=="UA")
ggplot(special.air, aes(x=arr_delay,y=dep_delay,colour=carrier))+geom_point()+xlim(-5, 150) +ylim(-5, 150) ## Warning: Removed 8143 rows containing missing values (geom_point).