Exercise 1:

These distributions are a unimodal right skewed data set. This right skew makes it asymmetric with high outliers at these which suggest that the mean is greater than the median and thus in analyzing the data it would be most efficient an accurate to you median and IQR statistics because they are less influenced by outliers than mean and standard deviation. When comparing the three histograms with changed bin width, more features are revealed in the 15 bin width as it displays more specifically each data point.

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"
## Classes 'tbl_df' and 'data.frame':   32735 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  6 5 12 5 7 1 12 8 9 4 ...
##  $ day      : int  30 7 8 14 21 1 9 13 26 30 ...
##  $ dep_time : int  940 1657 859 1841 1102 1817 1259 1920 725 1323 ...
##  $ dep_delay: num  15 -3 -1 -4 -3 -3 14 85 -10 62 ...
##  $ arr_time : int  1216 2104 1238 2122 1230 2008 1617 2032 1027 1549 ...
##  $ arr_delay: num  -4 10 11 -34 -8 3 22 71 -8 60 ...
##  $ carrier  : chr  "VX" "DL" "DL" "DL" ...
##  $ tailnum  : chr  "N626VA" "N3760C" "N712TW" "N914DL" ...
##  $ flight   : int  407 329 422 2391 3652 353 1428 1407 2279 4162 ...
##  $ origin   : chr  "JFK" "JFK" "JFK" "JFK" ...
##  $ dest     : chr  "LAX" "SJU" "LAX" "TPA" ...
##  $ air_time : num  313 216 376 135 50 138 240 48 148 110 ...
##  $ distance : num  2475 1598 2475 1005 296 ...
##  $ hour     : num  9 16 8 18 11 18 12 19 7 13 ...
##  $ minute   : num  40 57 59 41 2 17 59 20 25 23 ...

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

https://labs-az-02.oit.duke.edu:30623/graphics/plot.png?width=574&height=299&randomizer=703330593 https://labs-az-02.oit.duke.edu:30623/graphics/plot.png?width=574&height=299&randomizer=837300587 https://labs-az-02.oit.duke.edu:30623/graphics/plot.png?width=574&height=299&randomizer=1541130292

Exercise 3:

The histogram of this data appears to be mostly unimodal symmetric with various large positive outliers making it slightly right skewed. The mean is likely much higher than the median given that the values are mostly clustered around -1 or 0 with outliers as high as 200. These outliers would strongly alter the mean yet not affect the median to the same extent. Given this, the median and IQR statistics would be more representative of the data as a whole due to the fact that they are less influenced by strong outlier values.

A tibble: 5 x 4

carrier median_dd iqr_dd n 1 AA 5 17.5 10 2 B6 -10.5 12.2 6 3 DL -15 22 19 4 UA -10 22 21 5 VX -22.5 21.2 12

A tibble: 12 x 2 month mean_dd 1 7 20.8 2 6 20.4 3 12 17.4 4 4 14.6 5 3 13.5 6 5 13.3 7 8 12.6 8 2 10.7 9 1 10.2 10 9 6.87 11 11 6.10 12 10 5.88 > nycflights %>% + group_by(month) # A tibble: 32,735 x 16 # Groups: month [12] year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight 1 2013 6 30 940 15 1216 -4 VX N626VA 407 2 2013 5 7 1657 -3 2104 10 DL N3760C 329 3 2013 12 8 859 -1 1238 11 DL N712TW 422 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353 7 2013 12 9 1259 14 1617 22 WN N218WN 1428 8 2013 8 13 1920 85 2032 71 B6 N284JB 1407 9 2013 9 26 725 -10 1027 -8 AA N3FSAA 2279 10 2013 4 30 1323 62 1549 60 EV N12163 4162 # … with 32,725 more rows, and 6 more variables: origin , dest , # air_time , distance , hour , minute > nycflights %>% + group_by(month) %>% + summarise(median_dd = median(dep_delay)) %>% + arrange(desc(median_dd)) # A tibble: 12 x 2 month median_dd 1 12 1 2 6 0 3 7 0 4 3 -1 5 5 -1 6 8 -1 7 1 -2 8 2 -2 9 4 -2 10 11 -2 11 9 -3 12 10 -3

Exercise 5:

July has the highest average departure delay from any NYC airport while December has the highest median departure delay. When deciding which type of data is more reliable it appears that median is more accurate and informative as it gives more information of the likelihood of cancelation because it is more robust against the skewed data.

A tibble: 12 x 2 month median_dd 1 12 1 2 6 0 3 7 0 4 3 -1 5 5 -1 6 8 -1 7 1 -2 8 2 -2 9 4 -2 10 11 -2 11 9 -3 12 10 -3

Exercise 8:

N626VA is the tail number of the plane with the highest average velocity, which is 474

A tibble: 32,735 x 2 avg_speed tailnum
1 474. N626VA 2 444. N3760C 3 395. N712TW 4 447. N914DL 5 355. N823AY 6 319. N3AXAA 7 353. N218WN 8 285 N284JB 9 444. N3FSAA 10 447. N12163 … with 32,725 more rows

A tibble: 2 x 2

dep_type ot_arr_rate 1 on time 0.891 2 delayed 0

https://labs-az-02.oit.duke.edu:30623/graphics/plot.png?width=574&height=299&randomizer=965302941

Exercise 11

The latest a plane could leave and arrive on time was around 60 minutes. However, the average time that a plane could be delayed and still arrive on time was 20 minutes. At a 20 minute delay, you can still expect to arrive on time the majority of the time.

https://labs-az-02.oit.duke.edu:30623/graphics/plot.png?width=574&height=299&randomizer=889551689

