The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository.
The URL for your rpubs.com web page.
First, I loaded the necessary packages for the analysis and imported
an ‘airlines.csv’ file, which I had previously saved in the working
directory and also uploaded to GitHub for easy access. After loading the
dataset, I reviewed its structure. To organize the data by airline and
flight status, I first replaced empty strings with NA values. Using the
tidyr package, I filled missing airline names downward with
the fill function. Then, I renamed columns for clarity
using dplyr’s rename() function, and finally,
I removed rows with missing statuses by applying the
filter() function.
# Load required libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Read data from GitHub repository
airlines_dat <- read.csv(url('https://raw.githubusercontent.com/Shriyanshh/Data607-Week_Four-Assignment/refs/heads/main/airlines.csv'))
# Display the data
airlines_dat
# Replace empty strings in 'X' column with NA
airlines_dat$X[airlines_dat$X==""] <- NA
# Clean and organize the data
clean_data <- airlines_dat %>%
fill(X, .direction='down') %>% # Fill missing airline names downwards
rename(Airline = X, Status = X.1) %>% # Rename columns for clarity
filter(Status != "") # Remove rows where Status is empty
# Display the cleaned data
clean_data
To streamline the dataset, I transformed the multiple city columns
into a single column. Using the tidyr package’s
gather function, I converted the dataset from wide to long
format, creating a new ‘city’ column to consolidate city names and a
‘count’ column to aggregate the counts. I excluded the ‘Airlines’ and
‘Status’ columns from this transformation to maintain their original
format. This method efficiently condensed the city data into a more
manageable form.
# Combine city columns into a single column while preserving airline and status data
clean_data = clean_data %>%
gather(key = "City", value = "Count", -Airline, -Status) # Converts multiple city columns into two columns 'City' and 'Count'
clean_data
The final step in refining the data involved removing any empty
columns. This was achieved by applying the filter function
to exclude columns that contained no data.
# Remove rows where 'Count' is NA to ensure data integrity
clean_data = clean_data %>%
filter(!is.na(Count))
clean_data
For the assignment analysis on arrival delays between two airlines, I
first filtered the cleaned dataset to include only entries where the
Status was ‘delayed’. Using dplyr, I then grouped the data
first by ‘Airline’ and then by ‘City’ to prepare for detailed
comparative analysis. This structured approach allowed us to focus
specifically on delay patterns across different cities for each
airline.
# Filter for delayed flights and group data by Airline and City for focused analysis
summary_stats = clean_data %>%
filter(Status == "delayed") %>%
group_by(Airline, City)
summary_stats
After analyzing the dataset, it was found that Alaska Airlines experienced fewer delays compared to AM West at four out of five airports. However, in Seattle, Alaska had more delays than AM West, suggesting AM West as a better choice for avoiding delays in that city. The most significant difference was observed at Phoenix Airport, where AM West had 415 delays compared to Alaska’s 12. In contrast, the difference was least noticeable at San Francisco Airport with AM West and Alaska having 129 and 102 delays, respectively.
To calculate the proportion of delayed flights per city and airline, I first aggregated the total flight counts for each airline and city. Then, I retrieved the number of delayed flights for each grouping. By performing an inner join on these datasets, I obtained both the number of delays and total flights. Subsequently, I calculated the delay proportion and converted these figures to percentages. The resulting table displays the percentage of delayed flights, helping to identify which airline performs better in each city.
# Summarize total flights by airline and city
total_flights = clean_data %>%
group_by(Airline, City) %>%
summarize(Total_Count = sum(Count))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
# Filter for delayed flights and summarize by airline and city
delayed_flights = clean_data %>%
filter(Status == "delayed") %>%
group_by(Airline, City) %>%
summarize(Delayed_Count = sum(Count))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
# Join datasets to get total and delayed counts together
delayed_and_total = inner_join(total_flights, delayed_flights, by = c("Airline", "City"))
# Calculate the ratio of delayed flights
prop_delays = delayed_and_total %>%
mutate(Delayed_Ratio = Delayed_Count / Total_Count)
# Arrange by city and select columns of interest
prop_delays = prop_delays %>%
select(Airline, City, Delayed_Ratio) %>%
arrange(City)
# Convert ratio to percent and round off
prop_delays = prop_delays %>%
mutate(Delayed_Percent = round(Delayed_Ratio * 100, 1))
# Load knitr library for table creation
library(knitr)
# Create a table displaying the percentage of delayed flights by city and airline
prop_delays %>%
select(Airline, City, Delayed_Percent) %>%
kable(caption = "Percent of Delayed Flights by City and Airline")
| Airline | City | Delayed_Percent |
|---|---|---|
| ALASKA | Los.Angeles | 11.1 |
| AM WEST | Los.Angeles | 14.4 |
| ALASKA | Phoenix | 5.2 |
| AM WEST | Phoenix | 7.9 |
| ALASKA | San.Diego | 8.6 |
| AM WEST | San.Diego | 14.5 |
| ALASKA | San.Francisco | 16.9 |
| AM WEST | San.Francisco | 28.7 |
| ALASKA | Seattle | 14.2 |
| AM WEST | Seattle | 23.3 |
When comparing the ratio of delayed flights to total flights for each airline in various cities, the percentages are relatively similar between ALASKA and AM WEST. For instance, in San Diego, 8% of ALASKA’s flights are delayed compared to 14.5% for AM WEST. Similarly, in San Francisco, the delay rates are 16.9% for ALASKA and 28.7% for AM WEST. To visually represent this data, I used ggplot to create a graph displaying the total number of flights by airline for each city.
# Load the ggplot2 library for creating visualizations
library(ggplot2)
# Create a bar chart with total flight counts by city and airline
ggplot(total_flights, aes(x=City, y=Total_Count, fill=Airline)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Total Number of Flights per City per Airline",
x="City",
y="Total Number of Flights",
fill="Airline") +
theme_minimal()
The visualization clearly shows that Phoenix airport handles a significantly higher number of flights from AM WEST compared to ALASKA, which may contribute to their higher delay rates due to the increased volume of flights. Conversely, ALASKA operates more flights to Seattle than AM WEST, yet ALASKA still manages fewer delays there. This suggests that the volume of flights could be a factor in the frequency of delays, though other operational efficiencies likely also play a role.
In this assignment, I utilized the Tidyr and dplyr packages for their robust data cleaning and organizing capabilities, which streamlined handling complex or poorly formatted datasets. Starting with a wide-format CSV containing airline data, I transformed it into a format digestible by R. This cleaned dataset then allowed me to analyze and compare airline delays. By calculating the delay proportions and percentages per city and airline, and visualizing these in both tabular and graphical formats, I could effectively present the density of flights to various cities by each airline.