Lab_02: Introduction to data

Load the packages.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(dplyr)
library(ggplot2)

Required data

data(nycflights)

Variables names

names(nycflights)

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"

Quick view to understand data contents explicitly

glimpse(nycflights)

## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

Data Analysis

** Departure delays examining the distribution of departure delays of all flights with a histogram

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

**Chaging of data distribution shape by spliting data between diffrent bins 1.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 1: Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

** Ans1: From the histograms above, it is clear that none of them represents a normal distribution. That means the frequencies are not equally distributed. All the histograms represent skewed distribution that means asymmetrical in shape.The distribution lie on the right-hand side of the peak in each histogram.From the third histogram, the pick frequency value is clearly seen i.e.above 30000, whereas from the other two histograms the peak frequency values are not readily readable by value at a glance. Moreover, it is also seen that histogram with smaller binwidth reflecting more details of the data. Thus, the second histogram contains the good details of the data than the others.The third histogram has the highest binwidth that hiding a lot of details by clumping data altogether.Though the second histogram shows the most detail, the first histogram’s binwidth seems to be fine enough to visualize the data.

Exercise 2: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
sfo_feb_flights

## # A tibble: 68 × 16
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     2    18     1527        57     1903       48 DL      N711ZX    1322
##  2  2013     2     3      613        14     1008       38 UA      N502UA     691
##  3  2013     2    15      955        -5     1313      -28 DL      N717TW    1765
##  4  2013     2    18     1928        15     2239       -6 UA      N24212    1214
##  5  2013     2    24     1340         2     1644      -21 UA      N76269    1111
##  6  2013     2    25     1415       -10     1737      -13 UA      N532UA     394
##  7  2013     2     7     1032         1     1352      -10 B6      N627JB     641
##  8  2013     2    15     1805        20     2122        2 AA      N335AA     177
##  9  2013     2    13     1056        -4     1412      -13 UA      N532UA     642
## 10  2013     2     8      656        -4     1039       -6 DL      N710TW    1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## #   variable name ¹arr_delay

total_sfo_feb_flights<-sum(sfo_feb_flights$flight)
total_sfo_feb_flights

## [1] 54064

** Ans2: Total 54064 filghts headed to SFO in February.

Exercise 3: Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics.

ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth=15)

Ans3: The histogram above is right-skewed. Hence,the IQR will be good choice to describe the data distribution that actually reflects how the middle 50% of the data is distributed about the median. Both the values are given below by summarizing the data.

**Summary statistics:

sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(mean=mean(arr_delay),median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())

## # A tibble: 2 × 5
##   origin   mean median_ad iqr_ad n_flights
##   <chr>   <dbl>     <dbl>  <dbl>     <int>
## 1 EWR    -15.1      -15.5   17.5         8
## 2 JFK     -3.08     -10.5   22.8        60

Exercise 4: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())

## # A tibble: 5 × 4
##   carrier median_ad iqr_ad n_flights
##   <chr>       <dbl>  <dbl>     <int>
## 1 AA            5     17.5        10
## 2 B6          -10.5   12.2         6
## 3 DL          -15     22          19
## 4 UA          -10     22          21
## 5 VX          -22.5   21.2        12

** Ans4: Both the carriers, DL and UA have the most variable arrival delays as their interquartile ranges are equal with the highest value at 22. It means that both of them exhibit the greatest variation in arrival delays for the middle 50% of their data.

Exercise 5: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay),median_dd=median(dep_delay)) %>%
  arrange(desc(mean_dd))

## # A tibble: 12 × 3
##    month mean_dd median_dd
##    <int>   <dbl>     <dbl>
##  1     7   20.8          0
##  2     6   20.4          0
##  3    12   17.4          1
##  4     4   14.6         -2
##  5     3   13.5         -1
##  6     5   13.3         -1
##  7     8   12.6         -1
##  8     2   10.7         -2
##  9     1   10.2         -2
## 10     9    6.87        -3
## 11    11    6.10        -2
## 12    10    5.88        -3

** Ans5: Pros and cons of mean and median:

The mean uses of every element in the data set.It is sensitive to extreme elements. So if the data set is having few very high or few very low values, mean will give unrealistic picture.It is best suited for symmetrical distributions.Hence, the mean here represents the overall average departure delay by taking into account the effect of each delay and giving an idea as to how the data is distributed. On the other hand, median is insensitive to extreme values.Median will give true picture even if the data set values have too much disparity.It has no bearing on shape of data distribution. The mean can be skewed by outliers, whereas the ouliers do not skew the median.The important thing is that the more skewed the distribution is, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

Exercise 6: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights

## # A tibble: 32,735 × 17
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     6    30      940        15     1216       -4 VX      N626VA     407
##  2  2013     5     7     1657        -3     2104       10 DL      N3760C     329
##  3  2013    12     8      859        -1     1238       11 DL      N712TW     422
##  4  2013     5    14     1841        -4     2122      -34 DL      N914DL    2391
##  5  2013     7    21     1102        -3     1230       -8 9E      N823AY    3652
##  6  2013     1     1     1817        -3     2008        3 AA      N3AXAA     353
##  7  2013    12     9     1259        14     1617       22 WN      N218WN    1428
##  8  2013     8    13     1920        85     2032       71 B6      N284JB    1407
##  9  2013     9    26      725       -10     1027       -8 AA      N3FSAA    2279
## 10  2013     4    30     1323        62     1549       60 EV      N12163    4162
## # … with 32,725 more rows, 7 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## #   and abbreviated variable name ¹arr_delay

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time")*100 / n()) %>%
  arrange(desc(ot_dep_rate))

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA           72.8
## 2 JFK           69.4
## 3 EWR           63.7

Ans6: To answer the question, i assume a flight that is delayed for less than 5 minutes is basically “on time.” I also consider any flight delayed for 5 minutes of more to be “delayed”. So, I would select LGA (LaGuardia Aiport) to fly of as it has higher on time departure rate compared to other airports.

Exercise 7: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph).

nycflights<-nycflights %>% group_by(carrier) %>% mutate(avg_speed=distance/(air_time/60))
nycflights

## # A tibble: 32,735 × 18
## # Groups:   carrier [16]
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     6    30      940        15     1216       -4 VX      N626VA     407
##  2  2013     5     7     1657        -3     2104       10 DL      N3760C     329
##  3  2013    12     8      859        -1     1238       11 DL      N712TW     422
##  4  2013     5    14     1841        -4     2122      -34 DL      N914DL    2391
##  5  2013     7    21     1102        -3     1230       -8 9E      N823AY    3652
##  6  2013     1     1     1817        -3     2008        3 AA      N3AXAA     353
##  7  2013    12     9     1259        14     1617       22 WN      N218WN    1428
##  8  2013     8    13     1920        85     2032       71 B6      N284JB    1407
##  9  2013     9    26      725       -10     1027       -8 AA      N3FSAA    2279
## 10  2013     4    30     1323        62     1549       60 EV      N12163    4162
## # … with 32,725 more rows, 8 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## #   avg_speed <dbl>, and abbreviated variable name ¹arr_delay

Exercise 8: Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance.

plot1<-ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()
plot1

plot2<-ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()+scale_x_log10()+scale_y_log10()
plot2

**Ans8: From the scatter plots above, it is seen that as distance increases, the average speed also increases as well. The relationship appears to be linear in logarithmic plot.

Exercise 9: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

nycflights_for_3_carriers<- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflights_for_3_carriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()

**Ans9: From the scatter plot above, it is seen that the cutoff point for departure delays for the three carriers is approximately five minutes. So, considering the cutoff point, I can reasonably expect to arrive at destination on time. Again, the carriers can arrive at destinations on time by departing delays of up to 55 to 60 minutes.But these are not common scenarios. Majority of the flights are delaying on arrival at destinations if they depart late. Hence, in most cases, I can not expect to arrive destination on time if the carriers depart delay.