A CSV file is created by arranging the provided data in a table format, with rows representing airlines, their statuses (On Time or Delayed), and arrival delays at different destinations. The resulting CSV file, named “DATA_607_Wk5.csv,” is generated and saved for further analysis
Functions used:
r bind: the rbind function is utilized to row-bind multiple vectors containing airline-related data, such as airline names, statuses, and arrival delays, into a single matrix named csv, representing tabular information.
write.table: he write.table function is used to write data to a CSV file, specifying the data, file name, separator character, and options for excluding column and row names.
# Create a CSV file by inputting the data
csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c(NA, "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c(NA, "Delayed", 117, 415, 65, 129, 61))
write.table(csv, file = "DATA_607_Wk5.csv", sep = ",", col.names=F, row.names=F)
# Read the CSV file into R using link
flights_data <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data-607/main/DATA_607_Wk5.csv", stringsAsFactors = FALSE)
# Display
flights_data
## NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 AM WEST On Time 694 4840 383 320 201
## 4 <NA> Delayed 117 415 65 129 61
The previously created flights_data is transformed into a “long format” using the tidyr and dplyr packages. The data is reshaped to have separate rows for each airline, their statuses, and arrival delays at different destinations, facilitating further analysis and comparisons between the airlines’ performance.
Functions used: mutate: is used to create new columns or modify existing columns in a dataframe. ‘Mutate(Airlines1 = NA., Airlines2=lag(NA.))’: It creates two new columns, Airlines1 and Airlines2, with missing values (NA) and sets Airlines2 to be the lag (previous) value of Airlines1. ‘mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1)’: It creates two more new columns, Airline and Status. Airline is populated with non-missing values from Airlines1 and Airlines2, ensuring that each row now has an airline name. Status is set to the value of the second column in the original data (the “Status” column), which indicates whether the flight was on time or delayed.
gather: gather is used to convert wide data (data with multiple columns) into long data (data with fewer columns but more rows).
select: is used to choose specific columns from a dataframe.
arrange: is used to reorder rows in a dataframe based on specific column values.
# Import packages as necessary
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Convert to long format
long_data <- flights_data %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>%
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1) %>%
gather("Destination", "Flights", 3:7) %>%
select(Airline:Flights) %>%
arrange(Airline, desc(Status), Destination); long_data
## Airline Status Destination Flights
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San.Diego 212
## 4 ALASKA On Time San.Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los.Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San.Diego 20
## 9 ALASKA Delayed San.Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los.Angeles 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San.Diego 383
## 14 AM WEST On Time San.Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los.Angeles 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San.Diego 65
## 19 AM WEST Delayed San.Francisco 129
## 20 AM WEST Delayed Seattle 61
To conduct analysis, the mean arrival delays for the two airlines (ALASKA and AM WEST) at various destinations are calculated and visualized to provide insights into their punctuality performance. The mean delay values are presented in a tabular format, and a bar chart is generated to visually compare the arrival delays for each airline at different destinations, facilitating a better understanding of the data set.
Functions: group by function: is used to group rows of a dataframe based on one or more columns. In this case, it’s grouping the data by two columns, “Airline” and “Destination.”
summarize: is used to calculate summary statistics within each group created by ‘group_by’. g gplot function (from the ggplot2 package): ggplot is used for creating data visualizations, such as plots and charts.
library(dplyr)
# Calculate mean delays for each airline and destination
mean_delays <- long_data %>%
group_by(Airline, Destination) %>%
summarize(Mean_Delay = mean(Flights))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
# Print the mean_delays data frame
mean_delays
## # A tibble: 10 × 3
## # Groups: Airline [2]
## Airline Destination Mean_Delay
## <chr> <chr> <dbl>
## 1 ALASKA Los.Angeles 280.
## 2 ALASKA Phoenix 116.
## 3 ALASKA San.Diego 116
## 4 ALASKA San.Francisco 302.
## 5 ALASKA Seattle 1073
## 6 AM WEST Los.Angeles 406.
## 7 AM WEST Phoenix 2628.
## 8 AM WEST San.Diego 224
## 9 AM WEST San.Francisco 224.
## 10 AM WEST Seattle 131
# Create a bar chart to visualize mean delays of destination and airline
library(ggplot2)
ggplot(mean_delays, aes(x = Destination, y = Mean_Delay, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Mean Arrival Delays by Destination and Airline",
x = "Destination",
y = "Mean Delay",
fill = "Airline"
) +
theme_minimal()
Conclusion: Studying the outputs of the mean, and the visual graph, it can be seen that: ALASKA Airlines generally has lower mean arrival delays compared to AM WEST at most destinations, except for San Francisco, where it experiences a significantly higher mean delay of 302.5 minutes, indicating potential punctuality issues.
AM WEST Airlines generally experiences significantly higher mean arrival delays, notably at Phoenix and San Francisco, with delays of 2627.5 minutes and 224.5 minutes, respectively, although it performs relatively better at Los Angeles, San Diego, and Seattle, still encountering delays.