library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)

Load the untidy data

First, I am going to load the csv I created which cointains the untidy dataframe I created! I uploaded the csv on Github so it would be reproducible.

flights <- read.csv("https://raw.githubusercontent.com/NikoletaEm/607LABS/main/Flight_data")
View(flights)

As you can see it looks messy, having all those empty cells and the names being messed up. I am going to tidy it. Let’s start by renaming the columns so it looks more neat.

Tidy up Proccess: Rename Columns

flights <- flights %>%
  rename(
    Airline_Company="x",
    Arrival_Status="y",
    Los_Angeles = "Los.Angeles",
    Phoenix = "Phoenix",
    San_Diego = "San.Diego",
    San_Francisco = "San.Francisco",
    Seattle = "Seattle"
  )

Now that we’re done with that let’s remove the empty row and then somehow fix the empty cells.

Tidy up Proccess: Handling missing/empty data

flights <- flights[-3, ] ## I removed the 3rd row because it was empty.
flights <- flights %>%
  mutate_all(~ ifelse(trimws(.) == "", NA, .))
flights <- flights %>%
  fill(Airline_Company, .direction = "down")
head(flights)
##   Airline_Company Arrival_Status Los_Angeles Phoenix San_Diego San_Francisco
## 1          ALASKA        on time         497     221       212           503
## 2          ALASKA        delayed          62      12        20           102
## 4         AM WEST        on time         694    4840       383           320
## 5         AM WEST        delayed         117     415        65           129
##   Seattle
## 1    1841
## 2     305
## 4     201
## 5      61
###This code takes care of managing any missing or empty values in the flight_data dataframe. It ensures that wherever there are missing values, they are replaced with NA to maintain data integrity. Additionally, in the 'Airline_Company' column, any missing values are filled downwards with the most recent non-missing value. This process ensures that each entry has a valid value.

Tada! Now we both got rid off the empty row and filled the cells containing missing data.Now let’s make our data into a longer format as it is now displayed in a wider format.

Tidy up Proccess: Longer Format

tidy_flights <- flights %>% 
  pivot_longer(cols= !c("Airline_Company", "Arrival_Status"),
    names_to="Destination",
    values_to="Count"
  )
head(tidy_flights)
## # A tibble: 6 × 4
##   Airline_Company Arrival_Status Destination   Count
##   <chr>           <chr>          <chr>         <int>
## 1 ALASKA          on time        Los_Angeles     497
## 2 ALASKA          on time        Phoenix         221
## 3 ALASKA          on time        San_Diego       212
## 4 ALASKA          on time        San_Francisco   503
## 5 ALASKA          on time        Seattle        1841
## 6 ALASKA          delayed        Los_Angeles      62

We made our data format longer using the pivot_longer command. We did it ! We tidied up our data! Now let’s move on to some Analysis.

Analysis

We need to analyze the delays of each company for each destination. In order to do that we need to filter our dataframe and keep only the delayed flights.

delays <- tidy_flights %>%
                    filter(tidy_flights$Arrival_Status == "delayed")
custom_palette <- c(  "#900C3F", "#581845")
ggplot(delays,aes(x=Destination,y=Count,fill=Airline_Company))+ggtitle("Delay Analysis") + theme(plot.title = element_text(hjust = 0.5))+scale_fill_manual(values = custom_palette)+geom_bar(stat="identity", position="dodge") 

The plot highlights Am West as the company with the most delays overall, with Phoenix recording the highest number of delays.To provide a more comprehensive comparison and validate our observation, let’s examine the percentages of delays for each destination.his will offer deeper insights into the delay patterns and support our initial observation.

Percentages

Now we’ll create a new dataframe including only vital information: Airline Company, Destinations, and Delay Percentages. These percentages are computed by dividing the number of delays for each destination by the total count of flights for that destination across all airline companies.

Delayed_Flights_Percentage <- tidy_flights %>%
  group_by(Airline_Company, Destination) |> 
  mutate(Percent_Delay = Count / sum(Count) * 100) |> 
  filter(Arrival_Status=="delayed") |> 
  select(Airline_Company, Destination, Percent_Delay)



custom_palette2 <- c( "#32CD32", "#FFD700")
ggplot(Delayed_Flights_Percentage,aes(x=Destination,y=Percent_Delay,fill=Airline_Company))+ggtitle("Percentages of Delays")+theme(plot.title = element_text(hjust = 0.5))+  scale_fill_manual(values = custom_palette2)+geom_bar(stat="identity", position="dodge")+geom_text(aes(label = paste0(round(Percent_Delay), "%")), vjust = -0.5, position =position_dodge(width = 0.9))

Conclusion

After examining the plot displaying the percentages of delays, it becomes evident that AM WEST experiences a higher frequency of delays in its scheduled flights compared to ALASKA. Furthermore, the largest number of delays is observed in the cities of San Francisco and Seattle.