The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
Load the CSV data file from GitHub and read into R.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
flight<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Flights.csv")
flight
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Clean data. Add the header Airline and Status to column 1 and 2. To make the data process easily, change “on time” to “on time” and add "_" to all the cities.
flight[,2] <- sapply(flight[,2], str_replace, " ", "_")
flight <- rename(flight, airline=X,Status=X.1,Los_Angeles=Los.Angeles,San_Diego=San.Diego,San_Francisco=San.Francisco)
flight
## airline Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 ALASKA on_time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on_time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Arrival cities are variable, not header. To make the data easy to use, we can create a column name city and move the header under the column city. At the same time, on time and delay are two variables, which can be expended as two columns. We use tidyr to process the data.
flight <- flight %>%
gather(city, Frequency, 3:7) %>%
spread(Status, Frequency)
flight
## airline city delayed on_time
## 1 ALASKA Los_Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San_Diego 20 212
## 4 ALASKA San_Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los_Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San_Diego 65 383
## 9 AM WEST San_Francisco 129 320
## 10 AM WEST Seattle 61 201
We add delayed flights to on time flights to calculate the total number of flights by airlines and cities. Once we got those total numbers, we can use delayed flights to divide total flights. The result save in the column delayed rate. The following table sorted by the delayed rate. The table indicates AM West flight to San Francisco has the highest delayed rate, 28.73%.
flight<-flight%>%
mutate(total_flights=delayed+on_time,delayed_rate=round((delayed/total_flights)*100,2))%>%
arrange(desc(delayed_rate))
flight
## airline city delayed on_time total_flights delayed_rate
## 1 AM WEST San_Francisco 129 320 449 28.73
## 2 AM WEST Seattle 61 201 262 23.28
## 3 ALASKA San_Francisco 102 503 605 16.86
## 4 AM WEST San_Diego 65 383 448 14.51
## 5 AM WEST Los_Angeles 117 694 811 14.43
## 6 ALASKA Seattle 305 1841 2146 14.21
## 7 ALASKA Los_Angeles 62 497 559 11.09
## 8 ALASKA San_Diego 20 212 232 8.62
## 9 AM WEST Phoenix 415 4840 5255 7.90
## 10 ALASKA Phoenix 12 221 233 5.15
We will use the flight data to compare AM West and Alaska’s delayed rate. In addition, find out which city have the highest delayed rate.
library(ggplot2)
ggplot(flight, aes(y=delayed_rate,x=airline))+geom_boxplot()+geom_point()+ggtitle("Delayed Rate By Airline ")
ggplot(flight, aes(y=delayed_rate,x=city))+geom_boxplot()+geom_point()+ggtitle("Delayed Rate by City")
We use boxplot to compare the delayed Rate by Airline and cities. The first plot indicates compare to Alaska, AM West’s flight has higher delay rate. At the same time, when we look at those cities, San Francisco has the highest delayed rate. Traveler need to prepare for the flight delay if he/she need to fly to San Francisco. To the other hand, phoenix has the lowest delayed rate.