library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
library(ggplot2)
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
** Departure delays examining the distribution of departure delays of all flights with a histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
**Chaging of data distribution shape by spliting data between diffrent bins 1.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
2.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
** Ans1: From the histograms above, it is clear that none of them represents a normal distribution. That means the frequencies are not equally distributed. All the histograms represent skewed distribution that means asymmetrical in shape.The distribution lie on the right-hand side of the peak in each histogram.From the third histogram, the pick frequency value is clearly seen i.e.above 30000, whereas from the other two histograms the peak frequency values are not readily readable by value at a glance. Moreover, it is also seen that histogram with smaller binwidth reflecting more details of the data. Thus, the second histogram contains the good details of the data than the others.The third histogram has the highest binwidth that hiding a lot of details by clumping data altogether.Though the second histogram shows the most detail, the first histogram’s binwidth seems to be fine enough to visualize the data.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights
## # A tibble: 68 × 16
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 2 18 1527 57 1903 48 DL N711ZX 1322
## 2 2013 2 3 613 14 1008 38 UA N502UA 691
## 3 2013 2 15 955 -5 1313 -28 DL N717TW 1765
## 4 2013 2 18 1928 15 2239 -6 UA N24212 1214
## 5 2013 2 24 1340 2 1644 -21 UA N76269 1111
## 6 2013 2 25 1415 -10 1737 -13 UA N532UA 394
## 7 2013 2 7 1032 1 1352 -10 B6 N627JB 641
## 8 2013 2 15 1805 20 2122 2 AA N335AA 177
## 9 2013 2 13 1056 -4 1412 -13 UA N532UA 642
## 10 2013 2 8 656 -4 1039 -6 DL N710TW 1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## # variable name ¹arr_delay
total_sfo_feb_flights<-sum(sfo_feb_flights$flight)
total_sfo_feb_flights
## [1] 54064
** Ans2: Total 54064 filghts headed to SFO in February.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth=15)
Ans3: The histogram above is right-skewed. Hence,the IQR will be good
choice to describe the data distribution that actually reflects how the
middle 50% of the data is distributed about the median. Both the values
are given below by summarizing the data.
**Summary statistics:
sfo_feb_flights %>%
group_by(origin) %>%
summarise(mean=mean(arr_delay),median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## # A tibble: 2 × 5
## origin mean median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <dbl> <int>
## 1 EWR -15.1 -15.5 17.5 8
## 2 JFK -3.08 -10.5 22.8 60
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## # A tibble: 5 × 4
## carrier median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
** Ans4: Both the carriers, DL and UA have the most variable arrival delays as their interquartile ranges are equal with the highest value at 22. It means that both of them exhibit the greatest variation in arrival delays for the middle 50% of their data.
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay),median_dd=median(dep_delay)) %>%
arrange(desc(mean_dd))
## # A tibble: 12 × 3
## month mean_dd median_dd
## <int> <dbl> <dbl>
## 1 7 20.8 0
## 2 6 20.4 0
## 3 12 17.4 1
## 4 4 14.6 -2
## 5 3 13.5 -1
## 6 5 13.3 -1
## 7 8 12.6 -1
## 8 2 10.7 -2
## 9 1 10.2 -2
## 10 9 6.87 -3
## 11 11 6.10 -2
## 12 10 5.88 -3
** Ans5: Pros and cons of mean and median:
The mean uses of every element in the data set.It is sensitive to extreme elements. So if the data set is having few very high or few very low values, mean will give unrealistic picture.It is best suited for symmetrical distributions.Hence, the mean here represents the overall average departure delay by taking into account the effect of each delay and giving an idea as to how the data is distributed. On the other hand, median is insensitive to extreme values.Median will give true picture even if the data set values have too much disparity.It has no bearing on shape of data distribution. The mean can be skewed by outliers, whereas the ouliers do not skew the median.The important thing is that the more skewed the distribution is, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights
## # A tibble: 32,735 × 17
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## 7 2013 12 9 1259 14 1617 22 WN N218WN 1428
## 8 2013 8 13 1920 85 2032 71 B6 N284JB 1407
## 9 2013 9 26 725 -10 1027 -8 AA N3FSAA 2279
## 10 2013 4 30 1323 62 1549 60 EV N12163 4162
## # … with 32,725 more rows, 7 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## # and abbreviated variable name ¹arr_delay
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time")*100 / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 72.8
## 2 JFK 69.4
## 3 EWR 63.7
Ans6: To answer the question, i assume a flight that is delayed for less than 5 minutes is basically “on time.” I also consider any flight delayed for 5 minutes of more to be “delayed”. So, I would select LGA (LaGuardia Aiport) to fly of as it has higher on time departure rate compared to other airports.
nycflights<-nycflights %>% group_by(carrier) %>% mutate(avg_speed=distance/(air_time/60))
nycflights
## # A tibble: 32,735 × 18
## # Groups: carrier [16]
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## 7 2013 12 9 1259 14 1617 22 WN N218WN 1428
## 8 2013 8 13 1920 85 2032 71 B6 N284JB 1407
## 9 2013 9 26 725 -10 1027 -8 AA N3FSAA 2279
## 10 2013 4 30 1323 62 1549 60 EV N12163 4162
## # … with 32,725 more rows, 8 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## # avg_speed <dbl>, and abbreviated variable name ¹arr_delay
plot1<-ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()
plot1
plot2<-ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()+scale_x_log10()+scale_y_log10()
plot2
**Ans8: From the scatter plots above, it is seen that as distance
increases, the average speed also increases as well. The relationship
appears to be linear in logarithmic plot.
nycflights_for_3_carriers<- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflights_for_3_carriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()
**Ans9: From the scatter plot above, it is seen that the cutoff point for departure delays for the three carriers is approximately five minutes. So, considering the cutoff point, I can reasonably expect to arrive at destination on time. Again, the carriers can arrive at destinations on time by departing delays of up to 55 to 60 minutes.But these are not common scenarios. Majority of the flights are delaying on arrival at destinations if they depart late. Hence, in most cases, I can not expect to arrive destination on time if the carriers depart delay.