For this exercise we will be reading an un-tidy data set from a csv, which has been created using PostgreSQL sql. The goal is to use the available tools to make this data more useful according to tidy data best practices.
-Each variable must have its own column. -Each observation must have its own row. -Each value must have its own cell.
library(tidyverse)
(flights <- read.csv("flights_status.csv", na.strings = c("","NA")))
## airline timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA <NA> NA NA <NA>
## 4 AM WEST on time 694 4,840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
No need for the blank row partition in the dat. Lets filter that out.
(flights <- filter(flights, is.na(timing) == FALSE))
## airline timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4,840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
It seems that the airline data has not been entered for each observation. We can use fill() correct.
(flights <- fill(flights,airline))
## airline timing Los.Angeles Phoenix San.Diego Sanf.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4,840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
We want to make sure we have the appropriate data types to be able to make transformations or calculations down the road. In this case some of the flight count columns came in as character values and we need them and integers.
flights$Phoenix = as.numeric(flights$Phoenix)
## Warning: NAs introduced by coercion
flights$Seattle = as.numeric(flights$Seattle)
## Warning: NAs introduced by coercion
After examining the data, we realize that we have city values listed along the top as headers. Based on tidy data principles we want all values of the same type to be uder one variable/column. Thus, we will pivot the data to accomplish this.
(flights <- pivot_longer(flights,"Los.Angeles":"Seattle", names_to = "city", values_to = "flight_count"))
## # A tibble: 20 x 4
## airline timing city flight_count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time Sanf.Francisco 503
## 5 ALASKA on time Seattle NA
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed Sanf.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix NA
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time Sanf.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed Sanf.Francisco 129
## 20 AM WEST delayed Seattle 61
We can see that ratios between the airlines are fairly similar, with AM WEST being slightly more likely to be delayed.
library(ggplot2)
ggplot(flights, aes(x = airline,weight = flight_count, fill = timing )) +
geom_bar(position = "dodge")+
labs( y= "Number of Flights")
It appears from the city breakdown below that Phoenix is what is giving the edge to AM WEST in regards to delayed proportion.
ggplot(flights, aes(x = airline,weight = flight_count, fill = timing )) +
geom_bar(position = "dodge")+
labs( y= "Number of Flights")+
facet_wrap(~city)
It may be unfair to include the cities with “NA” values with no way of knowing if the value is zero. Lets return to the airline comparison without Seattle or Phoenix. There is no run away winner, but we can say that excluding the cities without complete data improves the performance metric for both airlines.
ggplot(filter(flights, city != "Seattle", city != "Phoenix"), aes(x = airline,weight = flight_count, fill = timing )) +
geom_bar(position = "dodge")+
labs( y= "Number of Flights")