library(tidyverse)
library(openintro)One can observe from the three histograms generated that by decreasing the binwidth, the histograms becomes more and more defined, allowing us to center on and visualize the mode of the departure delays. Here, a proper selection of the binwidth is necessary.
# Insert code for Exercise 1 here
data(nycflights)
ggplot(data=nycflights, aes(x=dep_delay))+geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=nycflights, aes(x=dep_delay))+geom_histogram(binwidth=15)ggplot(data=nycflights, aes(x=dep_delay))+geom_histogram(binwidth=150)A total of 68 flights meet the criteria of going to or heading to the SFO Airport in the month of February.
sfo_feb_flights<-nycflights%>%filter(dest=="SFO", month==2)The shape of the histogram generated informs us that the most useful statistical analysis tools that are to be used are the mean and the median of the data: as the triangular shape provides us with the information of the mode, and the collection of data at either ends of the spectrum denotes a high standard deviation and a large inter-quartile range. Therefore, the histogram denotes multiple outliers.
ggplot(data=sfo_feb_flights, aes(x=arr_delay))+geom_histogram(binwidth=5)sfo_feb_flights%>%summarize(mean_ad=mean(arr_delay),
median_ad=median(arr_delay),
n=n())## # A tibble: 1 x 3
## mean_ad median_ad n
## <dbl> <dbl> <int>
## 1 -4.5 -11 68
According to the code run below: the most variable of the arrival delays are attributed to both the carriers DL and UA; while the least variable of the arrival delays is displayed by the carrier B6, which boasts of a the smallest Interquartile Range of the given dataset.
sfo_feb_flights%>%
group_by(carrier)%>%
summarize(median_ad=median(arr_delay),
IQR_ad=IQR(arr_delay))## # A tibble: 5 x 3
## carrier median_ad IQR_ad
## <chr> <dbl> <dbl>
## 1 AA 5 17.5
## 2 B6 -10.5 12.2
## 3 DL -15 22
## 4 UA -10 22
## 5 VX -22.5 21.2
Using the mean departure delays is worse than using the median departure delays as the mean of the delays provides us with less information about the probability of a delay occuring; the median showcases the chance of such a delay happening; and that ranks the lowest in the latter data formulation.
If we were to depend of the latter case, we might not experience delays, but should we experience them, we would run across major delays; if the use the former data set, we would experience more frequent, but smaller delays.
nycflights%>%
group_by(month)%>%
summarize(mean_dd=mean(dep_delay))%>%
arrange(desc(mean_dd))## # A tibble: 12 x 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
nycflights%>%
group_by(month)%>%
summarize(median_dd=median(dep_delay))%>%
arrange(desc(median_dd))## # A tibble: 12 x 2
## month median_dd
## <int> <dbl>
## 1 12 1
## 2 6 0
## 3 7 0
## 4 3 -1
## 5 5 -1
## 6 8 -1
## 7 1 -2
## 8 2 -2
## 9 4 -2
## 10 11 -2
## 11 9 -3
## 12 10 -3
I would chose to fly out of the LGA airport based on the percentage calculated below.
nycflights<-nycflights%>%
mutate(dep_type=ifelse(dep_delay<5, "on time", "delayed"))
nycflights%>%
group_by(origin)%>%
summarize(ot_dep_rate=sum(dep_type=="on time")/n())%>%
arrange(desc(ot_dep_rate))## # A tibble: 3 x 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
nycflights<-nycflights%>%
mutate(avg_speed=distance/(air_time/60))ggplot(data=nycflights, aes(x=avg_speed, y=distance))+geom_point()Since the average speed of the airplanes appear to be 400 mph, when we calculate the average departure delay with the departure delay, we can establish a cutoff point of nearly 150 minutes, as anything beyond that delays the departure by a massive standard deviation; as the graph displays a near linear relation between the two delays.
flights<-nycflights%>%filter(carrier=="AA"|carrier=="DL"|carrier=="UA")
ggplot(data=flights, aes(x=dep_delay,y=arr_delay, color=carrier))+geom_point()flights%>%
group_by(arr_delay)%>%
summarize(median_dd=median(dep_delay),
sd_dd=sd(dep_delay))%>%
arrange(desc(sd_dd))## # A tibble: 339 x 3
## arr_delay median_dd sd_dd
## <dbl> <dbl> <dbl>
## 1 236 198. 104.
## 2 242 224. 98.3
## 3 104 56 87.7
## 4 124 69 81.7
## 5 194 201 78.6
## 6 164 99 63.3
## 7 96 44.5 62.5
## 8 256 254. 62.5
## 9 255 230. 60.1
## 10 122 110 58.9
## # ... with 329 more rows