Introduction

For this exercise we will be reading an un-tidy data set from a csv, which has been created using PostgreSQL sql. The goal is to use the available tools to make this data more useful according to tidy data best practices.

Tidy Rules:

-Each variable must have its own column. -Each observation must have its own row. -Each value must have its own cell.

library(tidyverse)
(flights <- read.csv("flights_status.csv", na.strings = c("","NA")))
##   airline  timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1  ALASKA on time         497     221       212            503   1,841
## 2    <NA> delayed          62      12        20            102     305
## 3    <NA>    <NA>          NA    <NA>        NA             NA    <NA>
## 4 AM WEST on time         694   4,840       383            320     201
## 5    <NA> delayed         117     415        65            129      61

Filter

No need for the blank row partition in the dat. Lets filter that out.

(flights <- filter(flights, is.na(timing) == FALSE))
##   airline  timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1  ALASKA on time         497     221       212            503   1,841
## 2    <NA> delayed          62      12        20            102     305
## 3 AM WEST on time         694   4,840       383            320     201
## 4    <NA> delayed         117     415        65            129      61

Fill

It seems that the airline data has not been entered for each observation. We can use fill() correct.

(flights <- fill(flights,airline))
##   airline  timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1  ALASKA on time         497     221       212            503   1,841
## 2  ALASKA delayed          62      12        20            102     305
## 3 AM WEST on time         694   4,840       383            320     201
## 4 AM WEST delayed         117     415        65            129      61

Data Type

We want to make sure we have the appropriate data types to be able to make transformations or calculations down the road. In this case some of the flight count columns came in as character values and we need them and integers.

flights$Phoenix = as.numeric(flights$Phoenix)
## Warning: NAs introduced by coercion
flights$Seattle = as.numeric(flights$Seattle)
## Warning: NAs introduced by coercion

Pivot

After examining the data, we realize that we have city values listed along the top as headers. Based on tidy data principles we want all values of the same type to be uder one variable/column. Thus, we will pivot the data to accomplish this.

(flights <- pivot_longer(flights,"Los.Angeles":"Seattle", names_to = "city", values_to = "flight_count"))
## # A tibble: 20 x 4
##    airline timing  city           flight_count
##    <chr>   <chr>   <chr>                 <dbl>
##  1 ALASKA  on time Los.Angeles             497
##  2 ALASKA  on time Phoenix                 221
##  3 ALASKA  on time San.Diego               212
##  4 ALASKA  on time Sanf.Francisco          503
##  5 ALASKA  on time Seattle                  NA
##  6 ALASKA  delayed Los.Angeles              62
##  7 ALASKA  delayed Phoenix                  12
##  8 ALASKA  delayed San.Diego                20
##  9 ALASKA  delayed Sanf.Francisco          102
## 10 ALASKA  delayed Seattle                 305
## 11 AM WEST on time Los.Angeles             694
## 12 AM WEST on time Phoenix                  NA
## 13 AM WEST on time San.Diego               383
## 14 AM WEST on time Sanf.Francisco          320
## 15 AM WEST on time Seattle                 201
## 16 AM WEST delayed Los.Angeles             117
## 17 AM WEST delayed Phoenix                 415
## 18 AM WEST delayed San.Diego                65
## 19 AM WEST delayed Sanf.Francisco          129
## 20 AM WEST delayed Seattle                  61

Comparison by Airline

We can see that ratios between the airlines are fairly similar, with AM WEST being slightly more likely to be delayed.

library(ggplot2)
ggplot(flights, aes(x = airline,weight = flight_count, fill = timing )) + 
  geom_bar(position = "dodge")+
  labs( y= "Number of Flights")

Comparison by City

It appears from the city breakdown below that Phoenix is what is giving the edge to AM WEST in regards to delayed proportion.

ggplot(flights, aes(x = airline,weight = flight_count, fill = timing )) + 
  geom_bar(position = "dodge")+
  labs( y= "Number of Flights")+
  facet_wrap(~city)

Conclusion

It may be unfair to include the cities with “NA” values with no way of knowing if the value is zero. Lets return to the airline comparison without Seattle or Phoenix. There is no run away winner, but we can say that excluding the cities without complete data improves the performance metric for both airlines.

ggplot(filter(flights, city != "Seattle", city != "Phoenix"), aes(x = airline,weight = flight_count, fill = timing )) + 
  geom_bar(position = "dodge")+
  labs( y= "Number of Flights")