For this assignment, we are given a dataset from Numbersense, Kaiser Fung McGraw Hill, 2013 and asked to: 1. Create a CSV in wide format 2. Read the CSV into R and tidy and transform the data using tidyr and dplyr 3. Perform analysis to compare the arrival delays for the two airlines
For this assignment, I deiced to create the CSV file right into R, by creating a dataframe for it and then writing it out to a CSV file:
flight_status <- data.frame(
Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
Flight.Status = c("on time", "delayed", "on time", "delayed"),
Los.Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San.Diego = c(212, 20, 383, 65),
San.Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61)
)
flight_status
## Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Now, we need to write the data frame to a CSV file:
write.csv(flight_status, "flight_status.csv", row.names = FALSE)
Now to make sure the file exists, I can use the file.exists() function:
if (file.exists("flight_status.csv")) {
print("flight_status.csv exists")
} else {
print("flight_status.csv does NOT exist")
}
## [1] "flight_status.csv exists"
Now with completing the creation of the CSV file, we can now move onto step 2, which is to tidy and transform the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
flights_wide <- read.csv("flight_status.csv", stringsAsFactors = FALSE)
head(flights_wide)
## Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
With this data being in wide structure, for the tidying portion, it is important to convert the data from a wide to a long format because by allowing each variable to form a column and each observation to form a row, it allows us to work with the data easier
flights_long <- flights_wide %>%
pivot_longer(
cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "Destination",
values_to = "Count"
)
head(flights_long)
## # A tibble: 6 × 4
## Airline Flight.Status Destination Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
After transforming the data to long form, each row how represents a unique combination of Airline, Flight.Status. and Destination
With doing this, we can now move onto step 3, which is performing simple analysis
For the analysis portion, I will be comparing:
flight_sum <- flights_long %>%
group_by(Airline, Flight.Status) %>%
summarize(TotalFlights = sum(Count), .groups = "drop")
flight_sum
## # A tibble: 4 × 3
## Airline Flight.Status TotalFlights
## <chr> <chr> <int>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
ggplot(flight_sum, aes(x = Airline, y = TotalFlights, fill = Flight.Status)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Total Flights by Airline and Flight Status",
x = "Airline",
y = "Number of Flights") +
theme_minimal()
By doing simple analysis to compare the two Airline’s flight status’ we are able to see that both airlines have much more on time flight arrivals compared to their delayed flight arrivals
Now moving onto the next analysis which is to compare the percentage of delayed flights by destination for each airline
percentage_delayed <- flights_long %>%
group_by(Airline, Destination, Flight.Status) %>%
summarize(Total = sum(Count), .groups = "drop") %>%
pivot_wider(names_from = Flight.Status, values_from = Total) %>%
mutate(
TotalFlights = `on time` + delayed,
PercentDelayed = round(delayed / TotalFlights * 100, 2)
)
percentage_delayed
## # A tibble: 10 × 6
## Airline Destination delayed `on time` TotalFlights PercentDelayed
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 497 559 11.1
## 2 ALASKA Phoenix 12 221 233 5.15
## 3 ALASKA San.Diego 20 212 232 8.62
## 4 ALASKA San.Francisco 102 503 605 16.9
## 5 ALASKA Seattle 305 1841 2146 14.2
## 6 AM WEST Los.Angeles 117 694 811 14.4
## 7 AM WEST Phoenix 415 4840 5255 7.9
## 8 AM WEST San.Diego 65 383 448 14.5
## 9 AM WEST San.Francisco 129 320 449 28.7
## 10 AM WEST Seattle 61 201 262 23.3
ggplot(percentage_delayed, aes(x = Destination, y = PercentDelayed, fill = Airline)) +
geom_col(position = "dodge") +
labs(title = "Percentage of Delayed Flights by Destination",
x = "Destination",
y = "Percent Delayed (%)") +
theme_minimal()
Using this analysis we can quickly identify which airline has a larger delay rate on specific routes in comparison to the others. Here we can see that the AM WEST has a larger delay rate than ALASKA, so safe to say we should avoid that airline when deciding to fly out
By creating a CSV file in wide version and tidying and transforming the data we were able to analyze the data and get a better understanding of the flights and Airline. By analyzing the total number of flights for both on time and delayed, we can see that both airline have many more on time arrivals compared to delayed, however, when taking a closer look at the overall delayed percentage of flights, we can see that AM WEST has a higher percentage rate of delayed flights compared to ALASKA airlines, so we can avoid that airline if we are worried about our flights being delayed