The goal of this assignment is to tidy a data set containing the frequency of on time and delayed flights to a number of cities for two airlines.
Airline Data
A CSV file was created to resemble the data above:
Airline CSV
To import the file we use the read.csv function. The file is stored in the working directory. We will also call the tidyr and dplyr functions.
library(stringr)
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
rawData <- read.csv("airlines.csv")
rawData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Removing the third row…
rawData = rawData[-c(3), ]
rawData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Tidy data should have one record per observation and all variables should be in the header row. The variables here would be airline, flight_status, destination_city, on_time, and delayed. To get the data into this format, we use the functions of packages tidyr.
rawData$X[2] <- "ALASKA"
rawData$X[4] <- "AM WEST"
rawData <- rename(rawData, airlines=X, flight_status=X.1)
rawData$flight_status <- str_replace_all(rawData$flight_status, " ", "_")
tidyData <- gather(rawData, "dest_city", "number_of_flights",3:7)
tidyData <- spread(tidyData, flight_status,4)
tidyData
## airlines dest_city delayed on_time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
To perform an analysis, we will have to transform the data into a data frame that fits our desired analysis. In this case, we will attempt to see the ratio of on time vs delayed flights per airline for each city.
df <- tidyData
df <- group_by(df, airlines, dest_city)
df <- summarize(df, total_on_time = sum(on_time), total_delayed = sum(delayed))
df$total_flights = df$total_delayed + df$total_on_time
df <- mutate(df, pct_on_time = round(total_on_time/total_flights*100,2), pct_delayed = round(total_delayed/total_flights*100,2))
analysis_df <- data.frame(df)
analysis_df
## airlines dest_city total_on_time total_delayed total_flights pct_on_time pct_delayed
## 1 ALASKA Los.Angeles 497 62 559 88.91 11.09
## 2 ALASKA Phoenix 221 12 233 94.85 5.15
## 3 ALASKA San.Diego 212 20 232 91.38 8.62
## 4 ALASKA San.Francisco 503 102 605 83.14 16.86
## 5 ALASKA Seattle 1841 305 2146 85.79 14.21
## 6 AM WEST Los.Angeles 694 117 811 85.57 14.43
## 7 AM WEST Phoenix 4840 415 5255 92.10 7.90
## 8 AM WEST San.Diego 383 65 448 85.49 14.51
## 9 AM WEST San.Francisco 320 129 449 71.27 28.73
## 10 AM WEST Seattle 201 61 262 76.72 23.28
We will attempt to visualize the percentage of on time flights per city in a bar chart.
ggplot(analysis_df, aes(x=dest_city,y = pct_delayed,fill=airlines)) +
geom_bar(width = .75,stat = "identity", position="dodge") +
ggtitle("Airline Delayed Rates per Destination City") +
labs(x="Destination City",y="Delayed Percentage", fill= "Airlines") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(breaks = seq(0,100,by = 1))
AM West has a higher rate of delayed arrivals. San Francisco has the highest rate of delayed arrivals across both airlines, while Phoenix has the least across both airlines.