I loaded the required libraries, dplyr and tidyr as well as flexdashboard and htmltools to, in the setup chunk which is set not to echo in this dashboard. These last two packages allowed me to format the html document, and its comments in this stratified rMarkdown format.
Here I load the date directly from my Github repository, which is public and setting the na.strings=c("") argument so that the blank cells beneath each airline, which we are to assume is the previous airline, is filled with NA. This will make filling them with a value easier later on in the process.
After loading the table, names were given the the first two columns, which had not headers in the csv by expressly assigning names to only the first two columns by selecting them with data[1:2].
Then I used apply() and gsub() to replace the “,” in the numbers (which are not character fields) with nothing, effectively removing them from the text. You can see the resulting data in the table below. We have uniformly formatted text and column names.
url <- "https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/airlinesRaw.csv"
raw_data <- read.csv(url, header = TRUE, na.strings = c("")) #NA added to facilitate fill later
names(raw_data)[1:2] <- c("Airline", "Arrival") #naming first two columns
raw_data <- data.frame(apply(raw_data, 2, function(y) (gsub(",", "", y)))) #removing ','
knitr::kable(raw_data)| Airline | Arrival | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
Starting with the table above, all the fields are still character data, with those two NA values in the Airline and Arrival.
I create a new frame, called new_data and using pipes, dplyr, and tidyr do the following:
fill()) value for AirlineArrival to Delayed and On_Time which will both meet the convention of capitalizing a variable, which I am following and the R requirement of no spaces (planning ahead)City from the current variable names and Flghts which will be filled with the values beneath the current column’s names by gathering in the wide dataFlights to create numeric values for this dataspread is used to split out On_Time column and Delayed columns as they are separate variableslevels are recoded using tidyr’s recode to remove the dots in the names, (not necessary but leads to uniformity accross data tables)You can see from the following table, that the data is long, uniform and well groomed.
new_data <- raw_data %>% fill(Airline) %>% mutate(Arrival = recode(Arrival,
delayed = "Delayed", `on time` = "On_Time")) %>% gather(City, Flights, Los.Angeles:Seattle) %>%
mutate(Flights = as.numeric(Flights)) %>% spread(Arrival, Flights) %>% mutate(City = as.factor(City)) %>%
mutate(City = recode(City, Los.Angeles = "Los Angeles", San.Diego = "San Diego",
San.Francisco = "San Francisco"))
knitr::kable(new_data)| Airline | City | Delayed | On_Time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AMWEST | Los Angeles | 117 | 694 |
| AMWEST | Phoenix | 415 | 4840 |
| AMWEST | San Diego | 65 | 383 |
| AMWEST | San Francisco | 129 | 320 |
| AMWEST | Seattle | 61 | 201 |
With major grooming done, it is a simple matter of creating percents of flights for each airline which were delayed. The formula for each airline is:\[ delays/all-flights\] The following steps were followed to calculate the percents, using dplyr and standard mathematical functions
Airline_Comparison and set it equal to new_dataAirlinePercent_Delay column using summarize() and the above formula, applying sum() to each column so as to take the total for all group_by() categories of Airline for the numerator (delays) and the denominator (delays + on-time flights), rounding to 3 places and multiplying by 100 to get the percent of each airline delayed in printing the table, I pasted the % symbol in to make the results unmistakable, but in the data set, they are still numeric data
Airline_Comparisons <- new_data %>% group_by(Airline) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) +
sum(On_Time))), 3) * 100)
knitr::kable(Airline_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay,
"%")))| Airline | Percent_Delay |
|---|---|
| ALASKA | 13.3 % |
| AMWEST | 10.9 % |
You can see a simple bar plot showing the percent of delays for each airline in the box below.
To show the difference in delays between cities, including all airlines, I repeated the proceedure to the right, used to sort and compare airlines, this time grouping by City.
The Computations and tables were made using the same process used with Airlines and saved into City_Comparisons. Again the formual for each city was numerator (delays) and the denominator (delays + on-time flights). They were again rounded to 3 places and multiplied by 100 to reflect a true percent.
City_Comparisons <- new_data %>% group_by(City) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) +
sum(On_Time))), 3) * 100)
knitr::kable(City_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay,
"%")))| City | Percent_Delay |
|---|---|
| Los Angeles | 13.1 % |
| Phoenix | 7.8 % |
| San Diego | 12.5 % |
| San Francisco | 21.9 % |
| Seattle | 15.2 % |
You can see a simple bar plot showing the percent of delays in each city in the box below.
In looking at the tables and charts for the two airlines, you can see that their delays were pretty similar around 11% and 14%. However, by plotting the delays by city, it is clear that some cities have very few delays (Phoenix around 8%) and others are late more than 20% of the time (San Francisco).
Based on the city information, and knowing that the real file has many more cities and airlines in it, a meaningful analysis should look at how the numbers of flights into heavily delayed cities impact and airlines delay rate, or if those cities appear to be more delayed because less timely airlines represent a higher proportion of their daily arrivals.
Although the data were simple, really just a 4x7 array, in order to easily use it, it was necessary to properly import, sort, modify and then tidy the set to be able to easily summarize the data included in it. The data definitely needed to be in long format.
Much of what was done with dplyr and tidyr could easily have been done with base R functions and other packages, using the tidy method made it possible t build upon intermediate steps in the data structuring process without editing row by row or column by column, or writing functions and using apply.
By carefully constructing the importation and data structuring process for this sample, it is highly likely that we could apply this script to the entire set of data across all airlines and cities, provided they followed the same conventions, which is what makes sorting data the Tidy Way so valuable.