The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
I will begin by creating a data frame which will include the information from the CSV file. We will use the link from the raw github code of the CSV file that we uploaded to our repository which includes all the information in the table. We will then load the libraries necessary to tidy this data which are tidyr and dplyr.
flightscsv1 <- read.csv("https://raw.githubusercontent.com/Zcash95/DATA607-5/main/DATA607airlineschedule%20-%20Sheet1.csv")
head(flightscsv1)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On-time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On-time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We’ll being by adding names to the columns that have none such as the first and second columns which indicate the airline and the status of the flight. We will add names to the airlines column to indicate which airline the delayed flights are from. We will then proceed by deleting any empty rows and as we can see the entire cells of row 3 are empty so we will just delete that row
flightscsv1 <- flightscsv1 %>%
rename("Airline" = 1, "Status" = 2, "Los Angeles" = 3, "Phoenix" = 4, "San Diego" = 5, "San Francisco" = 6, "Seattle" = 7)
print(flightscsv1)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska On-time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On-time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
flightscsv1[2, "Airline"] <- "Alaska"
flightscsv1[5, "Airline"] <- "AM West"
flightscsv1 <- flightscsv1 %>%
na.omit(flightscsv)
head(flightscsv1)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska On-time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 4 AM West On-time 694 4840 383 320 201
## 5 AM West Delayed 117 415 65 129 61
We will then attempt to tidy the data further by putting in a long format which is in rows. This will help us complete the analysis that we need to by putting the data in a longer format of rows instead of spread out by columns
flightscsv_long <- pivot_longer(flightscsv1, cols = -c(Airline, Status), names_to = "Destination", values_to = "Flights")
print(flightscsv_long)
## # A tibble: 20 × 4
## Airline Status Destination Flights
## <chr> <chr> <chr> <int>
## 1 Alaska On-time Los Angeles 497
## 2 Alaska On-time Phoenix 221
## 3 Alaska On-time San Diego 212
## 4 Alaska On-time San Francisco 503
## 5 Alaska On-time Seattle 1841
## 6 Alaska Delayed Los Angeles 62
## 7 Alaska Delayed Phoenix 12
## 8 Alaska Delayed San Diego 20
## 9 Alaska Delayed San Francisco 102
## 10 Alaska Delayed Seattle 305
## 11 AM West On-time Los Angeles 694
## 12 AM West On-time Phoenix 4840
## 13 AM West On-time San Diego 383
## 14 AM West On-time San Francisco 320
## 15 AM West On-time Seattle 201
## 16 AM West Delayed Los Angeles 117
## 17 AM West Delayed Phoenix 415
## 18 AM West Delayed San Diego 65
## 19 AM West Delayed San Francisco 129
## 20 AM West Delayed Seattle 61
We analyzed the data by creating new colums names On-time and delayed which counted the flights that were either on time or delayed. We then proceeded to add the total of delayed and on time flights and divided it by the number of delayed and this gave us a proportion indicating that Alaska airlines had more flights on time then AM West.
performance_data <- flightscsv_long %>%
group_by(Airline, Status) %>%
summarise(Count = n())
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
performance_data <- performance_data %>%
mutate(
OnTimeCount = ifelse(Status == "On-time", Count, 0),
DelayedCount = ifelse(Status == "Delayed", Count, 0)) %>%
select(Airline, OnTimeCount, DelayedCount)
performance_data <- performance_data %>%
group_by(Airline) %>%
summarise( OnTime = sum(OnTimeCount),Delayed = sum(DelayedCount))
performance_data <- performance_data %>%
mutate(OnTimePercentage = (OnTime / (OnTime + Delayed)) * 100,
DelayedPercentage = (Delayed / (OnTime + Delayed)) * 100)
print(performance_data)
## # A tibble: 2 × 5
## Airline OnTime Delayed OnTimePercentage DelayedPercentage
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AM West 5 5 50 50
## 2 Alaska 5 5 50 50
The conclusion that we got after analyzing our data is that Alaska airlines had more flights on time then AM West did.