library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
Load the untidy data
First, I am going to load the csv I created which cointains the
untidy dataframe I created! I uploaded the csv on Github so it would be
reproducible.
flights <- read.csv("https://raw.githubusercontent.com/NikoletaEm/607LABS/main/Flight_data")
View(flights)
As you can see it looks messy, having all those empty cells and the
names being messed up. I am going to tidy it. Let’s start by renaming
the columns so it looks more neat.
Tidy up Proccess: Rename Columns
flights <- flights %>%
rename(
Airline_Company="x",
Arrival_Status="y",
Los_Angeles = "Los.Angeles",
Phoenix = "Phoenix",
San_Diego = "San.Diego",
San_Francisco = "San.Francisco",
Seattle = "Seattle"
)
Now that we’re done with that let’s remove the empty row and then
somehow fix the empty cells.
Tidy up Proccess: Handling missing/empty data
flights <- flights[-3, ] ## I removed the 3rd row because it was empty.
flights <- flights %>%
mutate_all(~ ifelse(trimws(.) == "", NA, .))
flights <- flights %>%
fill(Airline_Company, .direction = "down")
head(flights)
## Airline_Company Arrival_Status Los_Angeles Phoenix San_Diego San_Francisco
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 4 AM WEST on time 694 4840 383 320
## 5 AM WEST delayed 117 415 65 129
## Seattle
## 1 1841
## 2 305
## 4 201
## 5 61
###This code takes care of managing any missing or empty values in the flight_data dataframe. It ensures that wherever there are missing values, they are replaced with NA to maintain data integrity. Additionally, in the 'Airline_Company' column, any missing values are filled downwards with the most recent non-missing value. This process ensures that each entry has a valid value.
Tada! Now we both got rid off the empty row and filled the
cells containing missing data.Now let’s make our data into a longer
format as it is now displayed in a wider format.
Analysis
We need to analyze the delays of each company for each destination.
In order to do that we need to filter our dataframe and keep only the
delayed flights.
delays <- tidy_flights %>%
filter(tidy_flights$Arrival_Status == "delayed")
custom_palette <- c( "#900C3F", "#581845")
ggplot(delays,aes(x=Destination,y=Count,fill=Airline_Company))+ggtitle("Delay Analysis") + theme(plot.title = element_text(hjust = 0.5))+scale_fill_manual(values = custom_palette)+geom_bar(stat="identity", position="dodge")

The plot highlights Am West as the company with the most delays
overall, with Phoenix recording the highest number of delays.To provide
a more comprehensive comparison and validate our observation, let’s
examine the percentages of delays for each destination.his will offer
deeper insights into the delay patterns and support our initial
observation.
Conclusion
After examining the plot displaying the percentages of delays, it
becomes evident that AM WEST experiences a higher frequency of delays in
its scheduled flights compared to ALASKA. Furthermore, the largest
number of delays is observed in the cities of San Francisco and
Seattle.