Week Five- Tidy Data

row

Importing CSV & Basic Cleaning

I loaded the required libraries, dplyr and tidyr as well as flexdashboard and htmltools to, in the setup chunk which is set not to echo in this dashboard. These last two packages allowed me to format the html document, and its comments in this stratified rMarkdown format.

Here I load the date directly from my Github repository, which is public and setting the na.strings=c("") argument so that the blank cells beneath each airline, which we are to assume is the previous airline, is filled with NA. This will make filling them with a value easier later on in the process.

After loading the table, names were given the the first two columns, which had not headers in the csv by expressly assigning names to only the first two columns by selecting them with data[1:2].

Then I used apply() and gsub() to replace the “,” in the numbers (which are not character fields) with nothing, effectively removing them from the text. You can see the resulting data in the table below. We have uniformly formatted text and column names.

url <- "https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/airlinesRaw.csv"
raw_data <- read.csv(url, header = TRUE, na.strings = c(""))  #NA added to facilitate fill later
names(raw_data)[1:2] <- c("Airline", "Arrival")  #naming first two columns 
raw_data <- data.frame(apply(raw_data, 2, function(y) (gsub(",", "", y))))  #removing ',' 
knitr::kable(raw_data)

Airline	Arrival	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
AMWEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

row

Using Tidyr & Dplyr to Manipulate Data Frame

Starting with the table above, all the fields are still character data, with those two NA values in the Airline and Arrival.

I create a new frame, called new_data and using pipes, dplyr, and tidyr do the following:

Fill the two empty airline cells with the PREVIOUS (the default for fill()) value for Airline
Recode the Arrival to Delayed and On_Time which will both meet the convention of capitalizing a variable, which I am following and the R requirement of no spaces (planning ahead)
Create columns City from the current variable names and Flghts which will be filled with the values beneath the current column’s names by gathering in the wide data
Mutate is applied to the single column Flights to create numeric values for this data
spread is used to split out On_Time column and Delayed columns as they are separate variables
City is now converted to a factor
City levels are recoded using tidyr’s recode to remove the dots in the names, (not necessary but leads to uniformity accross data tables)

You can see from the following table, that the data is long, uniform and well groomed.

new_data <- raw_data %>% fill(Airline) %>% mutate(Arrival = recode(Arrival, 
    delayed = "Delayed", `on time` = "On_Time")) %>% gather(City, Flights, Los.Angeles:Seattle) %>% 
    mutate(Flights = as.numeric(Flights)) %>% spread(Arrival, Flights) %>% mutate(City = as.factor(City)) %>% 
    mutate(City = recode(City, Los.Angeles = "Los Angeles", San.Diego = "San Diego", 
        San.Francisco = "San Francisco"))
knitr::kable(new_data)

Airline	City	Delayed	On_Time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AMWEST	Los Angeles	117	694
AMWEST	Phoenix	415	4840
AMWEST	San Diego	65	383
AMWEST	San Francisco	129	320
AMWEST	Seattle	61	201

row

Comparison Code & Tables: Airline (all cities)

With major grooming done, it is a simple matter of creating percents of flights for each airline which were delayed. The formula for each airline is:\[ delays/all-flights\] The following steps were followed to calculate the percents, using dplyr and standard mathematical functions

Create a data frame called Airline_Comparison and set it equal to new_data
Which we group by Airline
Create a Percent_Delay column using summarize() and the above formula, applying sum() to each column so as to take the total for all group_by() categories of Airline for the numerator (delays) and the denominator (delays + on-time flights), rounding to 3 places and multiplying by 100 to get the percent of each airline delayed

in printing the table, I pasted the % symbol in to make the results unmistakable, but in the data set, they are still numeric data

Airline_Comparisons <- new_data %>% group_by(Airline) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) + 
    sum(On_Time))), 3) * 100)

knitr::kable(Airline_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay, 
    "%")))

Airline	Percent_Delay
ALASKA	13.3 %
AMWEST	10.9 %

You can see a simple bar plot showing the percent of delays for each airline in the box below.

Comparison Code & Tables: City (all airlines)

To show the difference in delays between cities, including all airlines, I repeated the proceedure to the right, used to sort and compare airlines, this time grouping by City.

The Computations and tables were made using the same process used with Airlines and saved into City_Comparisons. Again the formual for each city was numerator (delays) and the denominator (delays + on-time flights). They were again rounded to 3 places and multiplied by 100 to reflect a true percent.

City_Comparisons <- new_data %>% group_by(City) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) + 
    sum(On_Time))), 3) * 100)

knitr::kable(City_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay, 
    "%")))

City	Percent_Delay
Los Angeles	13.1 %
Phoenix	7.8 %
San Diego	12.5 %
San Francisco	21.9 %
Seattle	15.2 %

You can see a simple bar plot showing the percent of delays in each city in the box below.

row

Plot Airline Comparison

Plot City Comparison

row

Analysis

In looking at the tables and charts for the two airlines, you can see that their delays were pretty similar around 11% and 14%. However, by plotting the delays by city, it is clear that some cities have very few delays (Phoenix around 8%) and others are late more than 20% of the time (San Francisco).

Based on the city information, and knowing that the real file has many more cities and airlines in it, a meaningful analysis should look at how the numbers of flights into heavily delayed cities impact and airlines delay rate, or if those cities appear to be more delayed because less timely airlines represent a higher proportion of their daily arrivals.

row

Conclusion

Although the data were simple, really just a 4x7 array, in order to easily use it, it was necessary to properly import, sort, modify and then tidy the set to be able to easily summarize the data included in it. The data definitely needed to be in long format.

Much of what was done with dplyr and tidyr could easily have been done with base R functions and other packages, using the tidy method made it possible t build upon intermediate steps in the data structuring process without editing row by row or column by column, or writing functions and using apply.

By carefully constructing the importation and data structuring process for this sample, it is highly likely that we could apply this script to the entire set of data across all airlines and cities, provided they followed the same conventions, which is what makes sorting data the Tidy Way so valuable.