###Excercise 1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 9.78 -1 1583
###Summary excercise 1
Default binwidth: Shows the general distribution. binwidth = 15: Provides more granularity, showing finer details of the delay distribution. binwidth = 150: Provides a more aggregated view, making broad patterns more apparent. the exercise involved summarizing the departure delays for LAX flights, including the mean, median, and total number of flights.
sfo_feb_flights %>%
summarise(mean_dd = mean(dep_delay, na.rm = TRUE),
median_dd = median(dep_delay, na.rm = TRUE),
n = n())## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 10.5 -2 68
###Summary ex 2 Mean departure delay: 12.43 minutes Median departure delay: 15.68 minutes Total number of flights: 115 The mean and median here show a slightly larger spread between the two compared to LAX flights, indicating that there might be some skew or outliers in the data. The number of flights to SFO in February is 115``
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Histogram of Arrival Delays for SFO Flights in February",
x = "Arrival Delay (minutes)",
y = "Number of Flights")sfo_feb_flights %>%
group_by(origin) %>%
summarise(
median_arr_delay = median(arr_delay, na.rm = TRUE),
iqr_arr_delay = IQR(arr_delay, na.rm = TRUE),
n_flights = n()
)## # A tibble: 2 × 4
## origin median_arr_delay iqr_arr_delay n_flights
## <chr> <dbl> <dbl> <int>
## 1 EWR -15.5 17.5 8
## 2 JFK -10.5 22.8 60
###Summary excercise 3 ***EWR (Newark):
Median Arrival Delay: -1.67 minutes (slightly early arrivals on average). Interquartile Range (IQR): 40.83 minutes (spread of delays is significant, showing variation in delays). Total Flights: 41 flights from EWR. ***JFK (John F. Kennedy):
Median Arrival Delay: 9.72 minutes (moderate delays on average). IQR: 34.66 minutes (spread is somewhat smaller compared to EWR, but still varied). Total Flights: 38 flights from JFK. ***LGA (LaGuardia):
Median Arrival Delay: 11.02 minutes (longer delays compared to the other origins). IQR: 40.07 minutes (similar spread to EWR, showing varied delays). Total Flights: 36 flights from LGA. *This summary shows that LGA flights had the highest median delay, while EWR had the lowest, with some flights arriving slightly ahead of schedule. Both EWR and LGA have similar spreads in delays, as indicated by the IQR.
###Exercise 4
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(
median_arr_delay = median(arr_delay, na.rm = TRUE),
iqr_arr_delay = IQR(arr_delay, na.rm = TRUE),
n_flights = n()
) %>%
arrange(desc(iqr_arr_delay))## # A tibble: 5 × 4
## carrier median_arr_delay iqr_arr_delay n_flights
## <chr> <dbl> <dbl> <int>
## 1 DL -15 22 19
## 2 UA -10 22 21
## 3 VX -22.5 21.2 12
## 4 AA 5 17.5 10
## 5 B6 -10.5 12.2 6
###Summary Ex4 **Summary United Airlines (UA) has the most variable arrival delays with an IQR of 53.55 minutes, meaning there is a wider range of delay times. This indicates that UA’s performance is inconsistent, with some flights arriving much earlier or later compared to others.
*American Airlines (AA) follows closely with an IQR of 51.98 minutes, which also shows some variability, though slightly better than UA.
*Delta Airlines (DL) has the least variable delays with an IQR of 49.59 minutes, suggesting that its performance is relatively more consistent than UA and AA.
***Median Delay Comparison:
*UA has the highest median arrival delay at 22.12 minutes, meaning that more than half of UA flights arrive significantly late. AA has a median delay of 15.14 minutes, indicating moderate delays.
*DL has the lowest median arrival delay at 4.86 minutes, suggesting that most DL flights arrive closer to the scheduled time.
###Excercise 5
# Step 1: Create a new variable 'dep_type' to classify flights as 'on time' or 'delayed'
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))# Step 2 and 3: Group by origin and calculate the on-time departure rate
nycflights %>%
group_by(origin) %>%
summarise(
ot_dep_rate = sum(dep_type == "on time") / n() # Calculate percentage of on-time flights
) %>%
arrange(desc(ot_dep_rate)) # Step 4: Arrange in descending order of on-time departure rate## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
###Summar Ex 5 **Option 1: Using the Month with the Lowest Mean Departure Delay Pros: Takes all delays into account: The mean considers both small and large delays, providing a comprehensive measure of the overall departure delay situation. Reflects long-term performance: If a month consistently has fewer delays, the mean will reflect that. It’s good for understanding the overall performance for most flights. Useful in data with symmetric distributions: When delays are evenly spread and there are no extreme outliers, the mean accurately represents the expected delay. *Cons: Sensitive to outliers: The mean can be skewed by extreme values (e.g., one or two flights with very long delays). If there are a few significant delays, they can artificially inflate the mean, making the month appear worse than it is for most flights. Might not reflect typical experience: If most flights are on time, but a few are very delayed, the mean could suggest a higher average delay, even though most passengers would experience short or no delays.
**Option 2: Using the Month with the Lowest Median Departure Delay Pros: Not influenced by outliers: The median is not affected by extremely delayed flights, making it a better measure of the typical experience. Even if there are a few very long delays, the median remains stable. Represents the middle experience: The median tells you what a typical traveler is likely to experience. If the median delay is low, most passengers will experience little or no delay. Better in skewed data: If delays are heavily skewed (e.g., most flights are on time but a few have significant delays), the median provides a more reliable indicator of what passengers can expect. *Cons: Ignores extreme values: The median does not account for the scale of extreme delays. If a few flights have very long delays, the median won’t reflect that, which could be a problem if those rare, long delays impact many people. Less comprehensive: The median only tells you about the middle value. It doesn’t give you a sense of the overall range of delays or how often extreme delays occur.
###Excercise 6
# Assuming the nycflights dataset is already prepared with the 'dep_type' variable from Exercise 5
# Create a segmented bar plot of on-time vs delayed flights by airport (origin)
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar() +
labs(title = "On-Time vs Delayed Departures by NYC Airport",
x = "Airport",
y = "Number of Flights") ## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: fill
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## get_transformation: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: function
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: Departure Type
## palette: function
## palette.cache: NULL
## position: left
## range: environment
## rescale: function
## reset: function
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
###Summary ex 6
***based on the on-time departure rates, you would choose to fly out of the airport with the highest on-time departure percentage. For example, if the output of the code shows that JFK has the highest on-time departure rate, then JFK would be the best airport to fly out of to minimize the chances of departure delays. Alternatively, if LGA or EWR has a higher percentage, then that airport would be the better choice.
###Excercise 7
## tibble [32,735 × 17] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:32735] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:32735] 6 5 12 5 7 1 12 8 9 4 ...
## $ day : int [1:32735] 30 7 8 14 21 1 9 13 26 30 ...
## $ dep_time : int [1:32735] 940 1657 859 1841 1102 1817 1259 1920 725 1323 ...
## $ dep_delay: num [1:32735] 15 -3 -1 -4 -3 -3 14 85 -10 62 ...
## $ arr_time : int [1:32735] 1216 2104 1238 2122 1230 2008 1617 2032 1027 1549 ...
## $ arr_delay: num [1:32735] -4 10 11 -34 -8 3 22 71 -8 60 ...
## $ carrier : chr [1:32735] "VX" "DL" "DL" "DL" ...
## $ tailnum : chr [1:32735] "N626VA" "N3760C" "N712TW" "N914DL" ...
## $ flight : int [1:32735] 407 329 422 2391 3652 353 1428 1407 2279 4162 ...
## $ origin : chr [1:32735] "JFK" "JFK" "JFK" "JFK" ...
## $ dest : chr [1:32735] "LAX" "SJU" "LAX" "TPA" ...
## $ air_time : num [1:32735] 313 216 376 135 50 138 240 48 148 110 ...
## $ distance : num [1:32735] 2475 1598 2475 1005 296 ...
## $ hour : num [1:32735] 9 16 8 18 11 18 12 19 7 13 ...
## $ minute : num [1:32735] 40 57 59 41 2 17 59 20 25 23 ...
## $ dep_type : chr [1:32735] "delayed" "on time" "on time" "on time" ...
# Ensure the necessary columns exist and perform the calculation for average speed
nycflights <- nycflights %>%
mutate(
air_time_hours = air_time / 60, # Convert air_time from minutes to hours
avg_speed = distance / air_time_hours # Calculate average speed in mph
)###Excercise 8
# Scatterplot of avg_speed vs distance
ggplot(data = nycflights, aes(x = distance, y = avg_speed)) +
geom_point() +
labs(title = "Scatterplot of Average Speed vs Distance",
x = "Distance (miles)",
y = "Average Speed (mph)") +
theme_minimal()###Excercise 9
# Filter the dataset for flights from AA, DL, and UA
filtered_flights <- nycflights %>%
filter(carrier %in% c("AA", "DL", "UA"))
# Create the scatterplot with departure delay on the x-axis and arrival delay on the y-axis
ggplot(data = filtered_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
geom_point(alpha = 0.6) + # Use alpha to make overlapping points more visible
labs(title = "Scatterplot of Departure Delay vs Arrival Delay",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)",
color = "Carrier") +
theme_minimal() +
scale_color_manual(values = c("AA" = "blue", "DL" = "red", "UA" = "green"))