library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
head(nycflights)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
Exercise 1: Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another? Yes, the way we read the count value is different for each histogram, especially for the less value parts. It would be helpful to make 2 graphs, so the viewer can read the values more clearly.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Exercise 2: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
68 flights meet this category.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights
## # A tibble: 68 × 16
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 2 18 1527 57 1903 48 DL N711ZX 1322
## 2 2013 2 3 613 14 1008 38 UA N502UA 691
## 3 2013 2 15 955 -5 1313 -28 DL N717TW 1765
## 4 2013 2 18 1928 15 2239 -6 UA N24212 1214
## 5 2013 2 24 1340 2 1644 -21 UA N76269 1111
## 6 2013 2 25 1415 -10 1737 -13 UA N532UA 394
## 7 2013 2 7 1032 1 1352 -10 B6 N627JB 641
## 8 2013 2 15 1805 20 2122 2 AA N335AA 177
## 9 2013 2 13 1056 -4 1412 -13 UA N532UA 642
## 10 2013 2 8 656 -4 1039 -6 DL N710TW 1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## # variable name ¹​arr_delay
nrow(sfo_feb_flights)
## [1] 68
Exercise 3: Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth=10)
print(IQR(sfo_feb_flights$arr_delay))
## [1] 23.25
The graph is right skewed, therefore the higher values are present in the right side of the graph. It would be helpful to make 2 graphs to see the data more clearly: one for the right side, and the other for the left side which has lower values. The interquartile range (IQR) of an observation variable is the difference of its upper and lower quartiles. This is my first time ever using IQR, and the value I received for it is 23.25.
Exercise 4: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
#median(sfo_feb_flights$arr_delay)
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_arrivalDelay = median(arr_delay), iqr_arrivalDelay = IQR(arr_delay), n_flights = n())
## # A tibble: 5 × 4
## carrier median_arrivalDelay iqr_arrivalDelay n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
Delta Airlines and United Airlines have the most variable arrival delays. This is because their IQR are both at 22.00. This means that they have the greatest difference in arrival delays for the middle 50% of their data. (Not too sure if this is right!)
Exercise 5: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
#2, 100, -10, 50, 3
#-10, 2,3, 50, 100 median=3, mean=145/5=29
sfo_feb_flights %>%
group_by(month) %>%
summarise(lowMeanDelay = mean(dep_delay))%>%
arrange(desc(lowMeanDelay))
## # A tibble: 1 × 2
## month lowMeanDelay
## <int> <dbl>
## 1 2 10.5
sfo_feb_flights %>%
group_by(month) %>%
summarise(lowMedianDelay = median(dep_delay))%>%
arrange(desc(lowMedianDelay))
## # A tibble: 1 × 2
## month lowMedianDelay
## <int> <dbl>
## 1 2 -2
It seems to be that based on the code written that February is the best month and I don't have to choose between the 2 options because February is the best outcome for both situations. Personally, if I had to choose I would choose the mean; although outliers will change up my values.
Exercise 6: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
sfo_feb_flights %>%
group_by(month) %>%
summarise(lowMedianDelay = median(dep_delay))%>%
arrange(desc(lowMedianDelay))
## # A tibble: 1 × 2
## month lowMedianDelay
## <int> <dbl>
## 1 2 -2
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
I would choose LaGuardia Airport because it has the best time departure percent at 72.8%.
Exercise 7: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
nycflights <- nycflights %>%
mutate(avg_speed = distance/(arr_time/60))
head(nycflights)
## # A tibble: 6 × 18
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## # … with 8 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>, avg_speed <dbl>
Exercise 8: Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
ggplot(data = nycflights, aes(x = distance , y =avg_speed )) + geom_point()
As distance increases, average speed increases of the airplane. This makes sense because if we take a shorter flight, it doesn't go as high above the clouds compared to a longer flight, therefore allowing it to go faster.
Exercise 9: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
nycflightsBycarriers <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data =nycflightsBycarriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()