Code:
less_200 <- filter(flights, distance <= 200)
Answer: There are 22,977 flights that have less than 200 miles travel distance.
Code:
three_air <- filter(flights, origin == "JFK", dest == "ORD" | dest == "CVG")
Code:
two_to_fiveH <- filter(flights, distance >= 200 | distance <= 500)
This graph focuses on flights traveling distances between 200 and
500 miles. Additionally, it excludes flights departing from JFK and
those landing at ORD or CVG (Chicago or Dallas).
Code:
cons_flights <- filter(flights, !origin == "JFK", dest == "ORD" | dest == "CVG" | !distance >= 200 | distance <= 500)
ggplot(data = cons_flights) +
geom_histogram(mapping = aes(x = dep_time), fill="orange2", binwidth = 100) +
labs(title = "Conditional Departure Time",
x = "Departure time",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.text = element_text(size = rel(1.1))) +
scale_y_continuous(breaks = seq(0, 6000, 1000))
## Warning: Removed 3043 rows containing non-finite outside the scale range
## (`stat_bin()`).
Code:
miss_tail <- filter(flights, is.na(tailnum))
2512/336776
## [1] 0.007458964
Answer: There are 2512 flights that do not have tail numbers, and that’s 0.75% of all flights.
Code:
EV_cancel <- filter(flights, is.na(dep_time))
ggplot(data = EV_cancel, aes(x = month)) +
geom_bar(fill = "orange2") +
scale_x_continuous(limits = c(0, 13), breaks = seq(1, 12, 1)) +
scale_y_continuous(limits = c(0, 1500), breaks = seq(0, 1500, 250)) +
theme_classic() +
labs(title = "EV's Canceled Flights",
x = "Months",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.text = element_text(size = rel(1.1)))
Answer: According to the graph, ExpressJet Airlines
(EV) experienced the highest number of flight cancellations in February
2013.
Code:
is.unsorted(flights$month)
## [1] TRUE
Code:
longest_dis <- arrange(flights, desc(distance))
glimpse(longest_dis)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ dep_time <int> 857, 909, 914, 900, 858, 1019, 1042, 901, 641, 859, 855…
## $ sched_dep_time <int> 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, …
## $ dep_delay <dbl> -3, 9, 14, 0, -2, 79, 102, 1, 1301, -1, -5, 1, -4, -1, …
## $ arr_time <int> 1516, 1525, 1504, 1516, 1519, 1558, 1620, 1504, 1242, 1…
## $ sched_arr_time <int> 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1…
## $ arr_delay <dbl> -14, -5, -26, -14, -11, 28, 50, -26, 1272, -41, -48, -3…
## $ carrier <chr> "HA", "HA", "HA", "HA", "HA", "HA", "HA", "HA", "HA", "…
## $ flight <int> 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,…
## $ tailnum <chr> "N380HA", "N380HA", "N380HA", "N384HA", "N381HA", "N385…
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK",…
## $ dest <chr> "HNL", "HNL", "HNL", "HNL", "HNL", "HNL", "HNL", "HNL",…
## $ air_time <dbl> 659, 638, 616, 639, 635, 611, 612, 645, 640, 633, 613, …
## $ distance <dbl> 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4983, 4…
## $ hour <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ minute <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour <dttm> 2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 0…
Answer: The longest distance travled is 4983 miles. The origin is John F. Kennedy International Airport (JFK) and the destination is Daniel K. Inouye International Airport (HNL).
Code:
shortest_time <- arrange(flights, air_time)
glimpse(shortest_time)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 4, 12, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6, 8, 9, 9, 1, 1, …
## $ day <int> 16, 13, 6, 3, 5, 12, 2, 8, 18, 19, 8, 19, 12, 18, 3, 3,…
## $ dep_time <int> 1355, 537, 922, 2153, 1303, 2123, 1450, 2026, 1456, 222…
## $ sched_dep_time <int> 1315, 527, 851, 2129, 1315, 2130, 1500, 1935, 1329, 214…
## $ dep_delay <dbl> 40, 10, 31, 24, -12, -7, -10, 51, 87, 41, 137, 136, 129…
## $ arr_time <int> 1442, 622, 1021, 2247, 1342, 2211, 1547, 2131, 1533, 23…
## $ sched_arr_time <int> 1411, 628, 954, 2224, 1411, 2225, 1608, 2056, 1426, 224…
## $ arr_delay <dbl> 31, -6, 27, 23, -29, -14, -21, 35, 67, 19, 109, 115, 10…
## $ carrier <chr> "EV", "EV", "EV", "EV", "EV", "EV", "US", "9E", "EV", "…
## $ flight <int> 4368, 4631, 4276, 4619, 4368, 4619, 2132, 3650, 4118, 4…
## $ tailnum <chr> "N16911", "N12167", "N27200", "N13913", "N13955", "N129…
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "LGA", "JFK",…
## $ dest <chr> "BDL", "BDL", "BDL", "PHL", "BDL", "PHL", "BOS", "PHL",…
## $ air_time <dbl> 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,…
## $ distance <dbl> 116, 116, 116, 80, 116, 80, 184, 94, 116, 116, 116, 116…
## $ hour <dbl> 13, 5, 8, 21, 13, 21, 15, 19, 13, 21, 21, 21, 21, 11, 7…
## $ minute <dbl> 15, 27, 51, 29, 15, 30, 0, 35, 29, 45, 59, 59, 29, 38, …
## $ time_hour <dttm> 2013-01-16 13:00:00, 2013-04-13 05:00:00, 2013-12-06 0…
Answer: The shortest air time is 20 minutes, and it’s from Newark Liberty International Airport (EWR) to Bradley International Airport (BDL).
Code:
flights[rev(order(as.Date(flights$month, format="%m/%d/%y"))),]
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 31 NA 830 NA NA 1154
## 2 2013 12 31 NA 600 NA NA 735
## 3 2013 12 31 NA 1615 NA NA 1800
## 4 2013 12 31 NA 825 NA NA 1029
## 5 2013 12 31 NA 705 NA NA 931
## 6 2013 12 31 NA 855 NA NA 1142
## 7 2013 12 31 NA 1430 NA NA 1750
## 8 2013 12 31 NA 1500 NA NA 1817
## 9 2013 12 31 NA 2000 NA NA 2146
## 10 2013 12 31 NA 754 NA NA 1118
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Code:
class_suv <- filter(mpg, class == "suv")
plot3 <- ggplot(data = class_suv, aes(y = manufacturer, x = hwy)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(aes(fill = manufacturer)) +
scale_x_continuous(breaks = seq(10, 30, 5), limits = c(10, 30)) +
labs(title = "Suv's Mileage in Highway",
y = "Manufacturer",
x = "Highway mile per hour") +
theme(plot.title = element_text(hjust = 0.5, size = rel(3.5), color = "purple4"),
plot.margin = margin(1, 2, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(3), color = "orange4"),
axis.text = element_text(size = rel(2.6)))
plot4 <- ggplot(data = class_suv, aes(y = manufacturer, x = cty)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(aes(fill = manufacturer)) +
scale_x_continuous(breaks = seq(10, 25, 5), limits = c(10, 25)) +
labs(title = "Suv's Mileage in City",
y = "Manufacturer",
x = "City mile per hour") +
theme(plot.title = element_text(hjust = 0.5, size = rel(3.5), color = "purple4"),
plot.margin = margin(1, 2, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(3), color = "orange4"),
axis.text = element_text(size = rel(2.6)))
library(patchwork)
plot3 + plot4
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Answer: According to the two graphs, Subaru
outperformed its competitors in both city and highway mileage,
indicating that Subaru’s SUV had the best fuel economy among all
manufacturers in the MPG dataset.
Code:
ggplot(data = class_suv, aes(x = cty, y = hwy, color = manufacturer)) +
geom_point(show.legend = FALSE) +
facet_grid(year ~ manufacturer) +
scale_x_continuous(breaks = seq(10, 20, 5)) +
scale_y_continuous(breaks = seq(10, 25, 5)) +
labs(title = "1999 and 2008 Fuel Eco in SUVs",
x = "City miles /hour",
y = "Highway mile /hour") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.text = element_text(size = rel(1.1)))
Answer: According to the chart, Subaru showed the most
significant improvement in fuel economy between 1999 and 2008. The data
points for Subaru in 2008 are positioned higher on the highway miles per
hour axis and further along the city miles per hour axis compared to
1999, indicating an overall increase in both city and highway fuel
efficiency. While other manufacturers show relatively minor improvements
or remain consistent in their fuel efficiency.
Code:
long_delay <- filter(flights, dep_delay >= 120 | arr_delay >= 120)
others <- filter(flights, dep_delay < 120 & arr_delay < 120)
plot1 <- ggplot(long_delay) +
geom_bar(aes(month, y = after_stat(count/sum(count))), fill = "orange2") +
scale_x_continuous(breaks = seq(1, 12, 1)) +
labs(title = "Long Delay Flights",
x = "Month",
y = "Relative Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.8),color = "purple4", margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
plot2 <- ggplot(others) +
geom_bar(aes(month, y = after_stat(count/sum(count))), fill = "yellow4") +
scale_x_continuous(breaks = seq(1, 12, 1)) +
labs(title = "Short or no Delay Flights",
x = "Month",
y = "Relative Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), color = "purple4", margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
library(patchwork)
plot1 + plot2
Answer: According to the chart, there appears to be a
correlation between the month and the frequency of long-delay flights.
The left graph shows that long-delay flights are more frequent in the
summer months, particularly in June and July, while they are less
frequent in the fall and early winter months (September to November).
This suggests that seasonal factors, such as increased air traffic
during summer vacation or weather-related disruptions, may contribute to
longer delays.
In contrast, the right graph shows that short or no-delay flights are
relatively evenly distributed across all months, implying that normal
flight operations are less affected by seasonal variations. This further
supports the idea that long delays may be influenced by external factors
that vary by month, such as peak travel seasons or weather
conditions.