Lab2-DATA-606-Introduction-to-Data-BKvarnstrom

The data The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

First, we’ll view the nycflights data frame. Type the following in your console to load the data:

data(nycflights)

To view the names of the variables, type the command

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"

The codebook (description of the variables) can be accessed by pulling up the help file:

?nycflights

## starting httpd help server ... done

use glimpse to take a quick peek at your data to understand its contents better.

glimpse(nycflights)

## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

Departure delays Let’s start by examing the distribution of departure delays of all flights with a histogram.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This function says to plot the dep_delay variable from the nycflights data frame on the x-axis. It also defines a geom (short for geometric object), which describes the type of plot you will produce.

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

EXERCISE 1 Look carefully at these three histograms.

Question: How do they compare? Answer: All three histograms display the same data and all are right skewed.However the second histogram shows a drastic rise and fall in distribution.

Question: Are features revealed in one that are obscured in another? Answer: All three histograms display the same data however because of the different bin sizes the data is more obscure in the third histogram than the others.

If you want to visualize only on delays of flights headed to Los Angeles, you need to first filter the data for flights with that destination (dest == “LAX”) and then make a histogram of the departure delays of only those flights.

lax_flights <- nycflights %>%
  filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can also obtain numerical summaries for these flights:

lax_flights %>%
  summarise(mean_dd   = mean(dep_delay), 
            median_dd = median(dep_delay), 
            n         = n())

## # A tibble: 1 × 3
##   mean_dd median_dd     n
##     <dbl>     <dbl> <int>
## 1    9.78        -1  1583

You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February:

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

EXERCISE 2 Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights.

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

sfo_feb_flights

## # A tibble: 68 × 16
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     2    18     1527        57     1903       48 DL      N711ZX    1322
##  2  2013     2     3      613        14     1008       38 UA      N502UA     691
##  3  2013     2    15      955        -5     1313      -28 DL      N717TW    1765
##  4  2013     2    18     1928        15     2239       -6 UA      N24212    1214
##  5  2013     2    24     1340         2     1644      -21 UA      N76269    1111
##  6  2013     2    25     1415       -10     1737      -13 UA      N532UA     394
##  7  2013     2     7     1032         1     1352      -10 B6      N627JB     641
##  8  2013     2    15     1805        20     2122        2 AA      N335AA     177
##  9  2013     2    13     1056        -4     1412      -13 UA      N532UA     642
## 10  2013     2     8      656        -4     1039       -6 DL      N710TW    1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## #   variable name ¹arr_delay

QUESTION: How many flights meet these criteria? ANSWER: 68 Flights

sfo_feb_flights %>% summarise( n_flights = n())

## # A tibble: 1 × 1
##   n_flights
##       <int>
## 1        68

EXERCISE 3 Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())

## # A tibble: 2 × 4
##   origin median_dd iqr_dd n_flights
##   <chr>      <dbl>  <dbl>     <int>
## 1 EWR          0.5   5.75         8
## 2 JFK         -2.5  15.2         60

ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 15)

EXERCISE 4

Which month would you expect to have the highest average delay departing from an NYC airport? Months 7 & 6

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))

## # A tibble: 12 × 2
##    month mean_dd
##    <int>   <dbl>
##  1     7   20.8 
##  2     6   20.4 
##  3    12   17.4 
##  4     4   14.6 
##  5     3   13.5 
##  6     5   13.3 
##  7     8   12.6 
##  8     2   10.7 
##  9     1   10.2 
## 10     9    6.87
## 11    11    6.10
## 12    10    5.88

QUESTION: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

ANSWER: DL and UA have the most variable arrival delays

sfo_feb_flights %>%
  group_by(carrier) %>%
    summarise(median = median(arr_delay), IQR = IQR(arr_delay)) %>%
      arrange(desc(IQR))

## # A tibble: 5 × 3
##   carrier median   IQR
##   <chr>    <dbl> <dbl>
## 1 DL       -15    22  
## 2 UA       -10    22  
## 3 VX       -22.5  21.2
## 4 AA         5    17.5
## 5 B6       -10.5  12.2

EXERCISE 5 QUESTION: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

ANSWER: The mean is the average of the entire Flight data set so in this case the average departure delay will not be as accurate as the median departure delay which is the middle of the data set.

On time departure rate for NYC airports

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

EXERCISE 6 QUESTION: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

ANSWER: I would choose to fly out of LGA based on the fact that LGA have the highest on time departure rate.

You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

EXERCISE 7 Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))

nycflights

## # A tibble: 32,735 × 18
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     6    30      940        15     1216       -4 VX      N626VA     407
##  2  2013     5     7     1657        -3     2104       10 DL      N3760C     329
##  3  2013    12     8      859        -1     1238       11 DL      N712TW     422
##  4  2013     5    14     1841        -4     2122      -34 DL      N914DL    2391
##  5  2013     7    21     1102        -3     1230       -8 9E      N823AY    3652
##  6  2013     1     1     1817        -3     2008        3 AA      N3AXAA     353
##  7  2013    12     9     1259        14     1617       22 WN      N218WN    1428
##  8  2013     8    13     1920        85     2032       71 B6      N284JB    1407
##  9  2013     9    26      725       -10     1027       -8 AA      N3FSAA    2279
## 10  2013     4    30     1323        62     1549       60 EV      N12163    4162
## # … with 32,725 more rows, 8 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## #   avg_speed <dbl>, and abbreviated variable name ¹arr_delay

EXERCISE 8 Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

ggplot(data = nycflights, aes(x = distance, y = avg_speed, color= carrier)) + geom_point()+
       labs(title = "AVERAGE SPEED VS DISTANCE", x = "DISTANCE", y = "AVG. SPEED")

EXERCISE 9 QUESTION: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) What the cutoff point is for departure delays where you can still expect to get to your destination on time.

ANSWER:The cutoff point for departure delays where you can still expect to get to your destination on time seems to be somewhere between 50 and 60 mins.

nycflightsF <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")

ggplot(data = nycflightsF, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()+
       labs(title = "AVERAGE SPEED VS DISTANCE", x = "DISTANCE", y = "AVG. SPEED")

Lab2-DATA-606-Introduction-to-Data-BKvarnstrom

Beshkia Kvarnstrom

2023-02-09