Overview:
In this assignment we are given a wide structured table with information on arrival delays for two airlines across five destinations. We are assigned to create a .CSV file that includes the given fight information. The goal of this assignment is to tidying and transforming the data as described below:
Read the information from your .CSV file into R.
Use tidyr and dplyr to tidy and transform data.
Perform analysis to compare the arrival delays for the two airlines.
Code should be in an R Markdown file, and should include narrative descriptions of the data cleanup work, analysis, and conclusions.
To begin, we will create a .CSV file with given flights data, upload load to GitHub, then load from GitHub to R. From R, we will use the necessary libraries to tidy and transform the data.
Load required libaries:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyr)
library(dplyr)
library(ggplot2)
Obtain data form Github:
FlightsData <- read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Assignment%204/FlightsData.csv", header = TRUE)
View our data:
FlightsData
## x x1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The table above shows that the data in its current format is untidy:
The 1st two columns can have better name than x and x1.
The 3rd row contains no data, both NA values and empty strings.
The airline name is missing in two of the cells, rows 2 and 5 in the x column.
The data is in a wide format as we can see the names of the cities are condensed into a single column. Also, the values in the cells can also be shortened into a single column.
We can also see that the some if city names are produced in the column with a . characters (Los.Angeles, San.Diego, etc.). The . was not present in the city names in the original .CSV file, but they were added during the import to prevent there being spaces in the dataframe column names.
Tidy Data Steps:
To tidy the data, We will get rid of the row with no information and assign a name to the airline and status columns. We will also add the names of the airlines to the two empty cells in the airline column. Additionally, will transform the data from wide to long format as well as replace the . with space to prepare it for analysis:
flightsData2 <- FlightsData[-c(3),]
flightsData2
## x x1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
above table now shows 4 rows instead of 5 rows.
flightsData2 <- flightsData2 %>% rename(Airline = x, Status = x1)
flightsData2
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
above table now shows first to column names instead of x and x1.
flightsData2[2, 1] <- "ALASKA"
flightsData2
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Above table now shows missing Airline name in 2nd raw (ALASKA).
flightsData2[4, 1] <- "AM WEST"
flightsData2
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Above table shows missing Airline name in 4th raw (AM WEST).
The complete tidy data:
flightsData2 <- FlightsData[-c(3),]
flightsData2 <- flightsData2 %>% rename(Airline = x, Status = x1)
flightsData2[2, 1] <- "ALASKA"
flightsData2[4, 1] <- "AM WEST"
flightsData2 <- flightsData2 %>% gather("Destination", "Flights", 3:7) %>% arrange(Airline, desc(Status), Destination)
flightsData2$Destination <- str_replace_all(flightsData2$Destination, "\\.", " ")
flightsData2
## Airline Status Destination Flights
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
above table now shows transformation of the data from wide to long format and the replaced . with space for Destination.
Analyze Data with visualization:
Number of delayed and on-time flights to each destination by airline
airline1 <- filter(flightsData2, Airline == "ALASKA")
airline2 <- filter(flightsData2, Airline == "AM WEST")
airline1
## Airline Status Destination Flights
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
airline2
## Airline Status Destination Flights
## 1 AM WEST on time Los Angeles 694
## 2 AM WEST on time Phoenix 4840
## 3 AM WEST on time San Diego 383
## 4 AM WEST on time San Francisco 320
## 5 AM WEST on time Seattle 201
## 6 AM WEST delayed Los Angeles 117
## 7 AM WEST delayed Phoenix 415
## 8 AM WEST delayed San Diego 65
## 9 AM WEST delayed San Francisco 129
## 10 AM WEST delayed Seattle 61
Above are two tables representing the number of delayed and on-time flights to each destination by airline
ggplot(airline1, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
ggplot(airline2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
Above are two graphs representing the number of delayed and on-time
flights to each destination by airline
From the above table and the graph, We can see that Alaska Airline has the highest number of on-time flights and delayed flights in Seattle as well the AM West Airline has the highest on-time flights and delayed flights in Phoenix, this is probably because both cities has the highest number flights for both Airlines compared to other destinations.
Airline Status Transformation:
we can use spread() function from tidyr to transform the data to bring the number of delayed and on-time flights in two new columns (delayed and on-time).
dataTran <- spread(flightsData2, Status, Flights)
dataTran
## Airline Destination delayed on time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Total delayed and on-time by destination and their corresponding ratio:
totalCity <- dataTran %>% group_by(Destination) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalCity
## # A tibble: 5 × 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Los Angeles 179 1191 1370 0.131 0.869
## 2 Phoenix 427 5061 5488 0.0778 0.922
## 3 San Diego 85 595 680 0.125 0.875
## 4 San Francisco 231 823 1054 0.219 0.781
## 5 Seattle 366 2042 2408 0.152 0.848
statusRatio <- select(totalCity, -c("Delayed", "OnTime", "Total"))
statusRatio <- statusRatio %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Destination, y = Ratio, fill = Status), data = statusRatio, stat = "identity")
We can see from the above output and the graph that San Francisco has the largest ratio of delayed flights with about 22%, while Phoenix has the largest ratio of on time flights with about 92%.
The Destinations with the maximum and minimum number of flights respectively:
max <- totalCity %>% filter(Total == max(Total))
max
## # A tibble: 1 × 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Phoenix 427 5061 5488 0.0778 0.922
min <- totalCity %>% filter(Total == min(Total))
min
## # A tibble: 1 × 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 San Diego 85 595 680 0.125 0.875
Total delayed and on time by airline and their respective ratio:
totalAirline <- dataTran %>% group_by(Airline) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalAirline
## # A tibble: 2 × 6
## Airline Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA 501 3274 3775 0.133 0.867
## 2 AM WEST 787 6438 7225 0.109 0.891
statusRatio2 <- select(totalAirline, -c("Delayed", "OnTime", "Total"))
statusRatio2 <- statusRatio2 %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Airline, y = Ratio, fill = Status), data = statusRatio2, stat = "identity")
We can see above that the ratios for delayed and on-time flights for both airlines are fairly similar when destinations are combined, however AM West airlines has nearly double the number of flights than those of Alaska airlines.
Compare the number of delayed and on-time flights by destination and Airline:
ggplot(flightsData2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
Conclusion:
Finally, when we compare the number of delayed and on-time flights by destination and Airline (from the graph above), we can see that AM WEST airline had the most on time flights. However, there are two destinations to which ALASKA airline has more on time fights than AM WEST airline, and those are San Francisco and Seattle.