Part I: Lab Exercise


Lab 1: Find all flights that travel a distance of less than 200 miles. How many of such flights are there in the data set?


Code:

less_200 <- filter(flights, distance <= 200)

Answer: There are 22,977 flights that have less than 200 miles travel distance.



Lab 2:

a) Find all flights that depart from JFK and land on ORD or CVG (Chicago or Dallas).


Code:

three_air <- filter(flights, origin == "JFK", dest == "ORD" | dest == "CVG")


b) Find all flights that flied a distance between 200 and 500 miles.


Code:

two_to_fiveH <- filter(flights, distance >= 200 | distance <= 500)


c) Plot the histogram of departure time for filtered data either in a) or b).


This graph focuses on flights traveling distances between 200 and 500 miles. Additionally, it excludes flights departing from JFK and those landing at ORD or CVG (Chicago or Dallas).

Code:

cons_flights <- filter(flights, !origin == "JFK", dest == "ORD" | dest == "CVG" | !distance >= 200 | distance <= 500)

ggplot(data = cons_flights) +
  geom_histogram(mapping = aes(x = dep_time), fill="orange2", binwidth = 100) +
  labs(title = "Conditional Departure Time",
       x = "Departure time",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.text = element_text(size = rel(1.1))) +
  scale_y_continuous(breaks = seq(0, 6000, 1000))
## Warning: Removed 3043 rows containing non-finite outside the scale range
## (`stat_bin()`).



Lab 3: How many flights have a missing plane tail number? What is the percentage of flights with a missing plane tail number?


Code:

miss_tail <- filter(flights, is.na(tailnum))

2512/336776
## [1] 0.007458964

Answer: There are 2512 flights that do not have tail numbers, and that’s 0.75% of all flights.



Lab 4: Use data transformation and data visualization, answer that For the airline EV in the data set flights, during which month were most flights canceled? Submit your code, graph and answer.


Code:

EV_cancel <- filter(flights, is.na(dep_time))

ggplot(data = EV_cancel, aes(x = month)) +
  geom_bar(fill = "orange2") +
  scale_x_continuous(limits = c(0, 13), breaks = seq(1, 12, 1)) +
  scale_y_continuous(limits = c(0, 1500), breaks = seq(0, 1500, 250)) +
  theme_classic() +
  labs(title = "EV's Canceled Flights",
       x = "Months",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.text = element_text(size = rel(1.1)))

Answer: According to the graph, ExpressJet Airlines (EV) experienced the highest number of flight cancellations in February 2013.


Lab 5: Find a way to verify that flights data do not strictly follow the order of time and month 10, 11, 12 goes after month 1.


Code:

is.unsorted(flights$month)
## [1] TRUE

Lab 6:

1. What was the longest travel distance for any flight in our data set? What was the origin and the destination?


Code:

longest_dis <- arrange(flights, desc(distance))
glimpse(longest_dis)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ dep_time       <int> 857, 909, 914, 900, 858, 1019, 1042, 901, 641, 859, 855…
## $ sched_dep_time <int> 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, …
## $ dep_delay      <dbl> -3, 9, 14, 0, -2, 79, 102, 1, 1301, -1, -5, 1, -4, -1, …
## $ arr_time       <int> 1516, 1525, 1504, 1516, 1519, 1558, 1620, 1504, 1242, 1…
## $ sched_arr_time <int> 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1…
## $ arr_delay      <dbl> -14, -5, -26, -14, -11, 28, 50, -26, 1272, -41, -48, -3…
## $ carrier        <chr> "HA", "HA", "HA", "HA", "HA", "HA", "HA", "HA", "HA", "…
## $ flight         <int> 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,…
## $ tailnum        <chr> "N380HA", "N380HA", "N380HA", "N384HA", "N381HA", "N385…
## $ origin         <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK",…
## $ dest           <chr> "HNL", "HNL", "HNL", "HNL", "HNL", "HNL", "HNL", "HNL",…
## $ air_time       <dbl> 659, 638, 616, 639, 635, 611, 612, 645, 640, 633, 613, …
## $ distance       <dbl> 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4…
## $ hour           <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ minute         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour      <dttm> 2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 0…

Answer: The longest distance travled is 4983 miles. The origin is John F. Kennedy International Airport (JFK) and the destination is Daniel K. Inouye International Airport (HNL).


2. What was the shortest air time for any flight (that actually finished the trip) in our data set? What was the origin and the destination?


Code:

shortest_time <- arrange(flights, air_time)
glimpse(shortest_time)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 4, 12, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6, 8, 9, 9, 1, 1, …
## $ day            <int> 16, 13, 6, 3, 5, 12, 2, 8, 18, 19, 8, 19, 12, 18, 3, 3,…
## $ dep_time       <int> 1355, 537, 922, 2153, 1303, 2123, 1450, 2026, 1456, 222…
## $ sched_dep_time <int> 1315, 527, 851, 2129, 1315, 2130, 1500, 1935, 1329, 214…
## $ dep_delay      <dbl> 40, 10, 31, 24, -12, -7, -10, 51, 87, 41, 137, 136, 129…
## $ arr_time       <int> 1442, 622, 1021, 2247, 1342, 2211, 1547, 2131, 1533, 23…
## $ sched_arr_time <int> 1411, 628, 954, 2224, 1411, 2225, 1608, 2056, 1426, 224…
## $ arr_delay      <dbl> 31, -6, 27, 23, -29, -14, -21, 35, 67, 19, 109, 115, 10…
## $ carrier        <chr> "EV", "EV", "EV", "EV", "EV", "EV", "US", "9E", "EV", "…
## $ flight         <int> 4368, 4631, 4276, 4619, 4368, 4619, 2132, 3650, 4118, 4…
## $ tailnum        <chr> "N16911", "N12167", "N27200", "N13913", "N13955", "N129…
## $ origin         <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "LGA", "JFK",…
## $ dest           <chr> "BDL", "BDL", "BDL", "PHL", "BDL", "PHL", "BOS", "PHL",…
## $ air_time       <dbl> 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,…
## $ distance       <dbl> 116, 116, 116, 80, 116, 80, 184, 94, 116, 116, 116, 116…
## $ hour           <dbl> 13, 5, 8, 21, 13, 21, 15, 19, 13, 21, 21, 21, 21, 11, 7…
## $ minute         <dbl> 15, 27, 51, 29, 15, 30, 0, 35, 29, 45, 59, 59, 29, 38, …
## $ time_hour      <dttm> 2013-01-16 13:00:00, 2013-04-13 05:00:00, 2013-12-06 0…

Answer: The shortest air time is 20 minutes, and it’s from Newark Liberty International Airport (EWR) to Bradley International Airport (BDL).


3. Find a way to arrange your data by month and day in descending order (starting Dec 31th and ending Jan 1st)


Code:

flights[rev(order(as.Date(flights$month, format="%m/%d/%y"))),]
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    31       NA            830        NA       NA           1154
##  2  2013    12    31       NA            600        NA       NA            735
##  3  2013    12    31       NA           1615        NA       NA           1800
##  4  2013    12    31       NA            825        NA       NA           1029
##  5  2013    12    31       NA            705        NA       NA            931
##  6  2013    12    31       NA            855        NA       NA           1142
##  7  2013    12    31       NA           1430        NA       NA           1750
##  8  2013    12    31       NA           1500        NA       NA           1817
##  9  2013    12    31       NA           2000        NA       NA           2146
## 10  2013    12    31       NA            754        NA       NA           1118
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>



Part II: Some More Exercises



1. In the mpg data set, which manufacturer produced the most fuel economic SUVs?


Code:

class_suv <- filter(mpg, class == "suv")

plot3 <- ggplot(data = class_suv, aes(y = manufacturer, x = hwy)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(aes(fill = manufacturer)) +
  scale_x_continuous(breaks = seq(10, 30, 5), limits = c(10, 30)) +
  labs(title = "Suv's Mileage in Highway",
       y = "Manufacturer",
       x = "Highway mile per hour") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(3.5), color = "purple4"),
        plot.margin = margin(1, 2, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(3), color = "orange4"),
        axis.text = element_text(size = rel(2.6)))

plot4 <- ggplot(data = class_suv, aes(y = manufacturer, x = cty)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(aes(fill = manufacturer)) +
  scale_x_continuous(breaks = seq(10, 25, 5), limits = c(10, 25)) +
  labs(title = "Suv's Mileage in City",
       y = "Manufacturer",
       x = "City mile per hour") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(3.5), color = "purple4"),
        plot.margin = margin(1, 2, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(3), color = "orange4"),
        axis.text = element_text(size = rel(2.6)))

library(patchwork)
plot3 + plot4
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Answer: According to the two graphs, Subaru outperformed its competitors in both city and highway mileage, indicating that Subaru’s SUV had the best fuel economy among all manufacturers in the MPG dataset.


2. In the mpg data set, which SUV manufacturer improved fuel economy most between 1999 and 2008?


Code:

ggplot(data = class_suv, aes(x = cty, y = hwy, color = manufacturer)) + 
  geom_point(show.legend = FALSE) + 
  facet_grid(year ~ manufacturer) +
  scale_x_continuous(breaks = seq(10, 20, 5)) +
  scale_y_continuous(breaks = seq(10, 25, 5)) +
  labs(title = "1999 and 2008 Fuel Eco in SUVs",
       x = "City miles /hour",
       y = "Highway mile /hour") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.text = element_text(size = rel(1.1)))

Answer: According to the chart, Subaru showed the most significant improvement in fuel economy between 1999 and 2008. The data points for Subaru in 2008 are positioned higher on the highway miles per hour axis and further along the city miles per hour axis compared to 1999, indicating an overall increase in both city and highway fuel efficiency. While other manufacturers show relatively minor improvements or remain consistent in their fuel efficiency.


3. In the flights data set, pick up another variable other than carrier and analyze whether that variable correlates with long-delay flights or not.


Code:

long_delay <- filter(flights, dep_delay >= 120 | arr_delay >= 120)
others <- filter(flights, dep_delay < 120 & arr_delay < 120)

plot1 <- ggplot(long_delay) +
  geom_bar(aes(month, y = after_stat(count/sum(count))), fill = "orange2") +
  scale_x_continuous(breaks = seq(1, 12, 1)) +
  labs(title = "Long Delay Flights",
       x = "Month",
       y = "Relative Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.8),color = "purple4", margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4), color = "orange4"), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

plot2 <- ggplot(others) +
  geom_bar(aes(month, y = after_stat(count/sum(count))), fill = "yellow4") +
  scale_x_continuous(breaks = seq(1, 12, 1)) +
  labs(title = "Short or no Delay Flights",
       x = "Month",
       y = "Relative Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), color = "purple4", margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4), color = "orange4"), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

library(patchwork)
plot1 + plot2

Answer: According to the chart, there appears to be a correlation between the month and the frequency of long-delay flights. The left graph shows that long-delay flights are more frequent in the summer months, particularly in June and July, while they are less frequent in the fall and early winter months (September to November). This suggests that seasonal factors, such as increased air traffic during summer vacation or weather-related disruptions, may contribute to longer delays.

In contrast, the right graph shows that short or no-delay flights are relatively evenly distributed across all months, implying that normal flight operations are less affected by seasonal variations. This further supports the idea that long delays may be influenced by external factors that vary by month, such as peak travel seasons or weather conditions.