Overview:

In this assignment we are given a wide structured table with information on arrival delays for two airlines across five destinations. We are assigned to create a .CSV file that includes the given fight information. The goal of this assignment is to tidying and transforming the data as described below:

To begin, we will create a .CSV file with given flights data, upload load to GitHub, then load from GitHub to R. From R, we will use the necessary libraries to tidy and transform the data.

Load required libaries:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidyr)
library(dplyr)
library(ggplot2)

Obtain data form Github:

FlightsData <- read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Assignment%204/FlightsData.csv", header = TRUE)

View our data:

FlightsData
##         x      x1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

The table above shows that the data in its current format is untidy:

Tidy Data Steps:

To tidy the data, We will get rid of the row with no information and assign a name to the airline and status columns. We will also add the names of the airlines to the two empty cells in the airline column. Additionally, will transform the data from wide to long format as well as replace the . with space to prepare it for analysis:

flightsData2 <- FlightsData[-c(3),]
flightsData2
##         x      x1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

above table now shows 4 rows instead of 5 rows.

flightsData2 <- flightsData2 %>% rename(Airline = x, Status = x1)
flightsData2
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

above table now shows first to column names instead of x and x1.

flightsData2[2, 1] <- "ALASKA"
flightsData2
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Above table now shows missing Airline name in 2nd raw (ALASKA).

flightsData2[4, 1] <- "AM WEST"
flightsData2
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Above table shows missing Airline name in 4th raw (AM WEST).

The complete tidy data:

flightsData2 <- FlightsData[-c(3),]
flightsData2 <- flightsData2 %>% rename(Airline = x, Status = x1)
flightsData2[2, 1] <- "ALASKA"
flightsData2[4, 1] <- "AM WEST"
flightsData2 <- flightsData2 %>% gather("Destination", "Flights", 3:7) %>% arrange(Airline, desc(Status), Destination)
flightsData2$Destination <- str_replace_all(flightsData2$Destination, "\\.", " ")
flightsData2
##    Airline  Status   Destination Flights
## 1   ALASKA on time   Los Angeles     497
## 2   ALASKA on time       Phoenix     221
## 3   ALASKA on time     San Diego     212
## 4   ALASKA on time San Francisco     503
## 5   ALASKA on time       Seattle    1841
## 6   ALASKA delayed   Los Angeles      62
## 7   ALASKA delayed       Phoenix      12
## 8   ALASKA delayed     San Diego      20
## 9   ALASKA delayed San Francisco     102
## 10  ALASKA delayed       Seattle     305
## 11 AM WEST on time   Los Angeles     694
## 12 AM WEST on time       Phoenix    4840
## 13 AM WEST on time     San Diego     383
## 14 AM WEST on time San Francisco     320
## 15 AM WEST on time       Seattle     201
## 16 AM WEST delayed   Los Angeles     117
## 17 AM WEST delayed       Phoenix     415
## 18 AM WEST delayed     San Diego      65
## 19 AM WEST delayed San Francisco     129
## 20 AM WEST delayed       Seattle      61

above table now shows transformation of the data from wide to long format and the replaced . with space for Destination.

Analyze Data with visualization:

Number of delayed and on-time flights to each destination by airline

airline1 <- filter(flightsData2, Airline == "ALASKA")
airline2 <- filter(flightsData2, Airline == "AM WEST")
airline1
##    Airline  Status   Destination Flights
## 1   ALASKA on time   Los Angeles     497
## 2   ALASKA on time       Phoenix     221
## 3   ALASKA on time     San Diego     212
## 4   ALASKA on time San Francisco     503
## 5   ALASKA on time       Seattle    1841
## 6   ALASKA delayed   Los Angeles      62
## 7   ALASKA delayed       Phoenix      12
## 8   ALASKA delayed     San Diego      20
## 9   ALASKA delayed San Francisco     102
## 10  ALASKA delayed       Seattle     305
airline2
##    Airline  Status   Destination Flights
## 1  AM WEST on time   Los Angeles     694
## 2  AM WEST on time       Phoenix    4840
## 3  AM WEST on time     San Diego     383
## 4  AM WEST on time San Francisco     320
## 5  AM WEST on time       Seattle     201
## 6  AM WEST delayed   Los Angeles     117
## 7  AM WEST delayed       Phoenix     415
## 8  AM WEST delayed     San Diego      65
## 9  AM WEST delayed San Francisco     129
## 10 AM WEST delayed       Seattle      61

Above are two tables representing the number of delayed and on-time flights to each destination by airline

ggplot(airline1, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
ggplot(airline2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")

Above are two graphs representing the number of delayed and on-time flights to each destination by airline

From the above table and the graph, We can see that Alaska Airline has the highest number of on-time flights and delayed flights in Seattle as well the AM West Airline has the highest on-time flights and delayed flights in Phoenix, this is probably because both cities has the highest number flights for both Airlines compared to other destinations.

Airline Status Transformation:

we can use spread() function from tidyr to transform the data to bring the number of delayed and on-time flights in two new columns (delayed and on-time).

dataTran <- spread(flightsData2, Status, Flights)
dataTran
##    Airline   Destination delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

Total delayed and on-time by destination and their corresponding ratio:

totalCity <- dataTran %>% group_by(Destination) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalCity
## # A tibble: 5 × 6
##   Destination   Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>           <int>  <int> <int>      <dbl>       <dbl>
## 1 Los Angeles       179   1191  1370     0.131        0.869
## 2 Phoenix           427   5061  5488     0.0778       0.922
## 3 San Diego          85    595   680     0.125        0.875
## 4 San Francisco     231    823  1054     0.219        0.781
## 5 Seattle           366   2042  2408     0.152        0.848
statusRatio <- select(totalCity, -c("Delayed", "OnTime", "Total"))
statusRatio <- statusRatio %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Destination, y = Ratio, fill = Status), data = statusRatio, stat = "identity")

We can see from the above output and the graph that San Francisco has the largest ratio of delayed flights with about 22%, while Phoenix has the largest ratio of on time flights with about 92%.

The Destinations with the maximum and minimum number of flights respectively:

max <- totalCity %>% filter(Total == max(Total))
max
## # A tibble: 1 × 6
##   Destination Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>         <int>  <int> <int>      <dbl>       <dbl>
## 1 Phoenix         427   5061  5488     0.0778       0.922
min <- totalCity %>% filter(Total == min(Total))
min
## # A tibble: 1 × 6
##   Destination Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>         <int>  <int> <int>      <dbl>       <dbl>
## 1 San Diego        85    595   680      0.125       0.875

Total delayed and on time by airline and their respective ratio:

totalAirline <- dataTran %>% group_by(Airline) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalAirline
## # A tibble: 2 × 6
##   Airline Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>     <int>  <int> <int>      <dbl>       <dbl>
## 1 ALASKA      501   3274  3775      0.133       0.867
## 2 AM WEST     787   6438  7225      0.109       0.891
statusRatio2 <- select(totalAirline, -c("Delayed", "OnTime", "Total"))
statusRatio2 <- statusRatio2 %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Airline, y = Ratio, fill = Status), data = statusRatio2, stat = "identity")

We can see above that the ratios for delayed and on-time flights for both airlines are fairly similar when destinations are combined, however AM West airlines has nearly double the number of flights than those of Alaska airlines.

Compare the number of delayed and on-time flights by destination and Airline:

ggplot(flightsData2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")

Conclusion:

Finally, when we compare the number of delayed and on-time flights by destination and Airline (from the graph above), we can see that AM WEST airline had the most on time flights. However, there are two destinations to which ALASKA airline has more on time fights than AM WEST airline, and those are San Francisco and Seattle.