Lab 2-606

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(ggplot2)
head(nycflights)

## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>

Exercise 1: Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another? Yes, the way we read the count value is different for each histogram, especially for the less value parts. It would be helpful to make 2 graphs, so the viewer can read the values more clearly.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 2: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

68 flights meet this category.

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
sfo_feb_flights

## # A tibble: 68 × 16
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2013     2    18     1527        57     1903       48 DL      N711ZX    1322
##  2  2013     2     3      613        14     1008       38 UA      N502UA     691
##  3  2013     2    15      955        -5     1313      -28 DL      N717TW    1765
##  4  2013     2    18     1928        15     2239       -6 UA      N24212    1214
##  5  2013     2    24     1340         2     1644      -21 UA      N76269    1111
##  6  2013     2    25     1415       -10     1737      -13 UA      N532UA     394
##  7  2013     2     7     1032         1     1352      -10 B6      N627JB     641
##  8  2013     2    15     1805        20     2122        2 AA      N335AA     177
##  9  2013     2    13     1056        -4     1412      -13 UA      N532UA     642
## 10  2013     2     8      656        -4     1039       -6 DL      N710TW    1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## #   variable name ¹arr_delay

nrow(sfo_feb_flights)

## [1] 68

Exercise 3: Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth=10)

print(IQR(sfo_feb_flights$arr_delay))

## [1] 23.25

The graph is right skewed, therefore the higher values are present in the right side of the graph. It would be helpful to make 2 graphs to see the data more clearly: one for the right side, and the other for the left side which has lower values. The interquartile range (IQR) of an observation variable is the difference of its upper and lower quartiles. This is my first time ever using IQR, and the value I received for it is 23.25.

Exercise 4: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

#median(sfo_feb_flights$arr_delay)

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_arrivalDelay = median(arr_delay), iqr_arrivalDelay = IQR(arr_delay), n_flights = n())

## # A tibble: 5 × 4
##   carrier median_arrivalDelay iqr_arrivalDelay n_flights
##   <chr>                 <dbl>            <dbl>     <int>
## 1 AA                      5               17.5        10
## 2 B6                    -10.5             12.2         6
## 3 DL                    -15               22          19
## 4 UA                    -10               22          21
## 5 VX                    -22.5             21.2        12

Delta Airlines and United Airlines have the most variable arrival delays. This is because their IQR are both at 22.00. This means that they have the greatest difference in arrival delays for the middle 50% of their data. (Not too sure if this is right!)

Exercise 5: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

#2, 100, -10, 50, 3
#-10, 2,3, 50, 100   median=3, mean=145/5=29
sfo_feb_flights %>%
  group_by(month) %>%
  summarise(lowMeanDelay = mean(dep_delay))%>%
  arrange(desc(lowMeanDelay))

## # A tibble: 1 × 2
##   month lowMeanDelay
##   <int>        <dbl>
## 1     2         10.5

sfo_feb_flights %>%
  group_by(month) %>%
  summarise(lowMedianDelay = median(dep_delay))%>%
  arrange(desc(lowMedianDelay))

## # A tibble: 1 × 2
##   month lowMedianDelay
##   <int>          <dbl>
## 1     2             -2

It seems to be that based on the code written that February is the best month and I don't have to choose between the 2 options because February is the best outcome for both situations. Personally, if I had to choose I would choose the mean; although outliers will change up my values.

Exercise 6: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

sfo_feb_flights %>%
  group_by(month) %>%
  summarise(lowMedianDelay = median(dep_delay))%>%
  arrange(desc(lowMedianDelay))

## # A tibble: 1 × 2
##   month lowMedianDelay
##   <int>          <dbl>
## 1     2             -2

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

I would choose LaGuardia Airport because it has the best time departure percent at 72.8%.

Exercise 7: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

nycflights <- nycflights %>%
  mutate(avg_speed = distance/(arr_time/60))

head(nycflights)

## # A tibble: 6 × 18
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
## # … with 8 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>, avg_speed <dbl>

Exercise 8: Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

ggplot(data = nycflights, aes(x = distance , y =avg_speed )) + geom_point()

As distance increases, average speed increases of the airplane. This makes sense because if we take a shorter flight, it doesn't go as high above the clouds compared to a longer flight, therefore allowing it to go faster.

Exercise 9: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

nycflightsBycarriers <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data =nycflightsBycarriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()

Lab 2-606

Sangeetha Sasikumar

9/7/2022