Create a .CSV file that includes information in a “wide” structure. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Perform analysis to compare the arrival delays for the two airlines
library(tidyr)
library(dplyr)
library(ggplot2)
rawdata = read.csv("https://raw.githubusercontent.com/L-Velasco/Fall16_IS607/master/IS607%20Assignment%20-%20Airlines.csv"); rawdata
## Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on_time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AMWEST on_time 694 4840 383 320 201
## 4 AMWEST delayed 117 415 65 129 61
Used gather() to make the city destination variables be part of observations and spread() to make the arrival type (on time, delayed) be variables. This step reshapes the data from “wide” to “tall” presentation.
tidydata <- rawdata %>%
gather("destination","freq",3:7) %>%
spread(Arrival,freq); tidydata
## Airline destination delayed on_time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AMWEST Los.Angeles 117 694
## 7 AMWEST Phoenix 415 4840
## 8 AMWEST San.Diego 65 383
## 9 AMWEST San.Francisco 129 320
## 10 AMWEST Seattle 61 201
Used mutate() to add two new variables that sums the total flights, and calculate the percent of delayed flights, per Airline and destination. These new variables explore the airline’s arrival performance across the city destinations.
transformed_data <- mutate(tidydata,
total_flights = delayed + on_time,
pct_delayed = round(delayed / total_flights * 100)); transformed_data
## Airline destination delayed on_time total_flights pct_delayed
## 1 ALASKA Los.Angeles 62 497 559 11
## 2 ALASKA Phoenix 12 221 233 5
## 3 ALASKA San.Diego 20 212 232 9
## 4 ALASKA San.Francisco 102 503 605 17
## 5 ALASKA Seattle 305 1841 2146 14
## 6 AMWEST Los.Angeles 117 694 811 14
## 7 AMWEST Phoenix 415 4840 5255 8
## 8 AMWEST San.Diego 65 383 448 15
## 9 AMWEST San.Francisco 129 320 449 29
## 10 AMWEST Seattle 61 201 262 23
Visually show the percentage of delayed flights.
Given the dataset, the Alaska airline seems to perform better arriving on time compared to Amwest in each destination.
plot_data <- select(transformed_data, Airline, destination, pct_delayed)
ggplot(data = plot_data,aes(x=destination, y=pct_delayed)) +
geom_bar(stat="identity") +
facet_wrap(~ Airline) +
coord_flip() +
labs(title="Comparison of Arrival Delays", x="Destination", y="Percentage of Arrival Delays")
However, when we look at overall performance across all city destinations, Amwest seems to perform slightly better since this airline has the greater number of flights.
Airline_delayed_data <- transformed_data %>%
group_by(Airline) %>%
summarise(total_flights_overall = sum(total_flights), total_delayed = sum(delayed)) %>%
mutate(pct_delayed_overall = round((total_delayed / total_flights_overall) * 100))
Airline_delayed_data
## # A tibble: 2 × 4
## Airline total_flights_overall total_delayed pct_delayed_overall
## <fctr> <int> <int> <dbl>
## 1 ALASKA 3775 501 13
## 2 AMWEST 7225 787 11
ggplot(data = Airline_delayed_data,aes(x=Airline, y=pct_delayed_overall)) +
geom_bar(stat="identity") +
labs(title="Comparison of Arrival Delays", x="Airline", y="Percentage of Arrival Delays")