row

Importing CSV & Basic Cleaning


I loaded the required libraries, dplyr and tidyr as well as flexdashboard and htmltools to, in the setup chunk which is set not to echo in this dashboard. These last two packages allowed me to format the html document, and its comments in this stratified rMarkdown format.

Here I load the date directly from my Github repository, which is public and setting the na.strings=c("") argument so that the blank cells beneath each airline, which we are to assume is the previous airline, is filled with NA. This will make filling them with a value easier later on in the process.

After loading the table, names were given the the first two columns, which had not headers in the csv by expressly assigning names to only the first two columns by selecting them with data[1:2].

Then I used apply() and gsub() to replace the “,” in the numbers (which are not character fields) with nothing, effectively removing them from the text. You can see the resulting data in the table below. We have uniformly formatted text and column names.

url <- "https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/airlinesRaw.csv"
raw_data <- read.csv(url, header = TRUE, na.strings = c(""))  #NA added to facilitate fill later
names(raw_data)[1:2] <- c("Airline", "Arrival")  #naming first two columns 
raw_data <- data.frame(apply(raw_data, 2, function(y) (gsub(",", "", y))))  #removing ',' 
knitr::kable(raw_data)
Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
AMWEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

row

Using Tidyr & Dplyr to Manipulate Data Frame


Starting with the table above, all the fields are still character data, with those two NA values in the Airline and Arrival.

I create a new frame, called new_data and using pipes, dplyr, and tidyr do the following:

  • Fill the two empty airline cells with the PREVIOUS (the default for fill()) value for Airline
  • Recode the Arrival to Delayed and On_Time which will both meet the convention of capitalizing a variable, which I am following and the R requirement of no spaces (planning ahead)
  • Create columns City from the current variable names and Flghts which will be filled with the values beneath the current column’s names by gathering in the wide data
  • Mutate is applied to the single column Flights to create numeric values for this data
  • spread is used to split out On_Time column and Delayed columns as they are separate variables
  • City is now converted to a factor
  • City levels are recoded using tidyr’s recode to remove the dots in the names, (not necessary but leads to uniformity accross data tables)

You can see from the following table, that the data is long, uniform and well groomed.

new_data <- raw_data %>% fill(Airline) %>% mutate(Arrival = recode(Arrival, 
    delayed = "Delayed", `on time` = "On_Time")) %>% gather(City, Flights, Los.Angeles:Seattle) %>% 
    mutate(Flights = as.numeric(Flights)) %>% spread(Arrival, Flights) %>% mutate(City = as.factor(City)) %>% 
    mutate(City = recode(City, Los.Angeles = "Los Angeles", San.Diego = "San Diego", 
        San.Francisco = "San Francisco"))
knitr::kable(new_data)
Airline City Delayed On_Time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AMWEST Los Angeles 117 694
AMWEST Phoenix 415 4840
AMWEST San Diego 65 383
AMWEST San Francisco 129 320
AMWEST Seattle 61 201

row

Comparison Code & Tables: Airline (all cities)


With major grooming done, it is a simple matter of creating percents of flights for each airline which were delayed. The formula for each airline is:\[ delays/all-flights\] The following steps were followed to calculate the percents, using dplyr and standard mathematical functions

  • Create a data frame called Airline_Comparison and set it equal to new_data
  • Which we group by Airline
  • Create a Percent_Delay column using summarize() and the above formula, applying sum() to each column so as to take the total for all group_by() categories of Airline for the numerator (delays) and the denominator (delays + on-time flights), rounding to 3 places and multiplying by 100 to get the percent of each airline delayed

in printing the table, I pasted the % symbol in to make the results unmistakable, but in the data set, they are still numeric data

Airline_Comparisons <- new_data %>% group_by(Airline) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) + 
    sum(On_Time))), 3) * 100)

knitr::kable(Airline_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay, 
    "%")))
Airline Percent_Delay
ALASKA 13.3 %
AMWEST 10.9 %

You can see a simple bar plot showing the percent of delays for each airline in the box below.

Comparison Code & Tables: City (all airlines)


To show the difference in delays between cities, including all airlines, I repeated the proceedure to the right, used to sort and compare airlines, this time grouping by City.

The Computations and tables were made using the same process used with Airlines and saved into City_Comparisons. Again the formual for each city was numerator (delays) and the denominator (delays + on-time flights). They were again rounded to 3 places and multiplied by 100 to reflect a true percent.

City_Comparisons <- new_data %>% group_by(City) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) + 
    sum(On_Time))), 3) * 100)

knitr::kable(City_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay, 
    "%")))
City Percent_Delay
Los Angeles 13.1 %
Phoenix 7.8 %
San Diego 12.5 %
San Francisco 21.9 %
Seattle 15.2 %

You can see a simple bar plot showing the percent of delays in each city in the box below.

row

Plot Airline Comparison

Plot City Comparison

row

Analysis


In looking at the tables and charts for the two airlines, you can see that their delays were pretty similar around 11% and 14%. However, by plotting the delays by city, it is clear that some cities have very few delays (Phoenix around 8%) and others are late more than 20% of the time (San Francisco).

Based on the city information, and knowing that the real file has many more cities and airlines in it, a meaningful analysis should look at how the numbers of flights into heavily delayed cities impact and airlines delay rate, or if those cities appear to be more delayed because less timely airlines represent a higher proportion of their daily arrivals.

row

Conclusion


Although the data were simple, really just a 4x7 array, in order to easily use it, it was necessary to properly import, sort, modify and then tidy the set to be able to easily summarize the data included in it. The data definitely needed to be in long format.

Much of what was done with dplyr and tidyr could easily have been done with base R functions and other packages, using the tidy method made it possible t build upon intermediate steps in the data structuring process without editing row by row or column by column, or writing functions and using apply.

By carefully constructing the importation and data structuring process for this sample, it is highly likely that we could apply this script to the entire set of data across all airlines and cities, provided they followed the same conventions, which is what makes sorting data the Tidy Way so valuable.