Tidying Airline Data

The goal of this assignment is to tidy a data set containing the frequency of on time and delayed flights to a number of cities for two airlines.

Airline Data

1. Importing CSV data

A CSV file was created to resemble the data above:

Airline CSV

To import the file we use the read.csv function. The file is stored in the working directory. We will also call the tidyr and dplyr functions.

library(stringr)
library(ggplot2)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

rawData <- read.csv("airlines.csv")

rawData

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Removing the third row…

rawData = rawData[-c(3), ]

rawData

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

2. Tidying the data (tidyr)

Tidy data should have one record per observation and all variables should be in the header row. The variables here would be airline, flight_status, destination_city, on_time, and delayed. To get the data into this format, we use the functions of packages tidyr.

rawData$X[2] <- "ALASKA"
rawData$X[4] <- "AM WEST"

rawData <- rename(rawData, airlines=X, flight_status=X.1)

rawData$flight_status <- str_replace_all(rawData$flight_status, " ", "_")

tidyData <- gather(rawData, "dest_city", "number_of_flights",3:7)

tidyData <- spread(tidyData, flight_status,4)

tidyData

##    airlines     dest_city delayed on_time
## 1    ALASKA   Los.Angeles      62     497
## 2    ALASKA       Phoenix      12     221
## 3    ALASKA     San.Diego      20     212
## 4    ALASKA San.Francisco     102     503
## 5    ALASKA       Seattle     305    1841
## 6   AM WEST   Los.Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San.Diego      65     383
## 9   AM WEST San.Francisco     129     320
## 10  AM WEST       Seattle      61     201

3. Transforming the data (dplyr)

To perform an analysis, we will have to transform the data into a data frame that fits our desired analysis. In this case, we will attempt to see the ratio of on time vs delayed flights per airline for each city.

df <- tidyData
df <- group_by(df, airlines, dest_city)
df <- summarize(df, total_on_time = sum(on_time), total_delayed = sum(delayed))
df$total_flights = df$total_delayed + df$total_on_time
df <- mutate(df, pct_on_time = round(total_on_time/total_flights*100,2), pct_delayed = round(total_delayed/total_flights*100,2))
analysis_df <- data.frame(df)
analysis_df

##    airlines     dest_city total_on_time total_delayed total_flights pct_on_time pct_delayed
## 1    ALASKA   Los.Angeles           497            62           559       88.91       11.09
## 2    ALASKA       Phoenix           221            12           233       94.85        5.15
## 3    ALASKA     San.Diego           212            20           232       91.38        8.62
## 4    ALASKA San.Francisco           503           102           605       83.14       16.86
## 5    ALASKA       Seattle          1841           305          2146       85.79       14.21
## 6   AM WEST   Los.Angeles           694           117           811       85.57       14.43
## 7   AM WEST       Phoenix          4840           415          5255       92.10        7.90
## 8   AM WEST     San.Diego           383            65           448       85.49       14.51
## 9   AM WEST San.Francisco           320           129           449       71.27       28.73
## 10  AM WEST       Seattle           201            61           262       76.72       23.28

4. Data visualization (ggplot2)

We will attempt to visualize the percentage of on time flights per city in a bar chart.

ggplot(analysis_df, aes(x=dest_city,y = pct_delayed,fill=airlines)) +
   geom_bar(width = .75,stat = "identity", position="dodge") +
   ggtitle("Airline Delayed Rates per Destination City") +
   labs(x="Destination City",y="Delayed Percentage", fill= "Airlines") +
   theme(plot.title = element_text(hjust=0.5)) +
   scale_y_continuous(breaks = seq(0,100,by = 1))

5. Conclusions

AM West has a higher rate of delayed arrivals. San Francisco has the highest rate of delayed arrivals across both airlines, while Phoenix has the least across both airlines.