I loaded the required libraries, dplyr
and tidyr
as well as flexdashboard
and htmltools
to, in the setup chunk which is set not to echo in this dashboard. These last two packages allowed me to format the html document, and its comments in this stratified rMarkdown format.
Here I load the date directly from my Github repository, which is public and setting the na.strings=c("")
argument so that the blank cells beneath each airline, which we are to assume is the previous airline, is filled with NA. This will make filling them with a value easier later on in the process.
After loading the table, names were given the the first two columns, which had not headers in the csv by expressly assigning names
to only the first two columns by selecting them with data[1:2]
.
Then I used apply()
and gsub()
to replace the “,” in the numbers (which are not character fields) with nothing, effectively removing them from the text. You can see the resulting data in the table below. We have uniformly formatted text and column names.
url <- "https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/airlinesRaw.csv"
raw_data <- read.csv(url, header = TRUE, na.strings = c("")) #NA added to facilitate fill later
names(raw_data)[1:2] <- c("Airline", "Arrival") #naming first two columns
raw_data <- data.frame(apply(raw_data, 2, function(y) (gsub(",", "", y)))) #removing ','
knitr::kable(raw_data)
Airline | Arrival | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Starting with the table above, all the fields are still character data, with those two NA
values in the Airline and Arrival.
I create a new frame, called new_data
and using pipes
, dplyr
, and tidyr
do the following:
fill()
) value for Airline
Arrival
to Delayed
and On_Time
which will both meet the convention of capitalizing a variable, which I am following and the R requirement of no spaces (planning ahead)City
from the current variable names and Flghts
which will be filled with the values beneath the current column’s names by gathering in the wide dataFlights
to create numeric values for this dataspread
is used to split out On_Time
column and Delayed
columns as they are separate variableslevels
are recoded using tidyr’s recode
to remove the dots in the names, (not necessary but leads to uniformity accross data tables)You can see from the following table, that the data is long, uniform and well groomed.
new_data <- raw_data %>% fill(Airline) %>% mutate(Arrival = recode(Arrival,
delayed = "Delayed", `on time` = "On_Time")) %>% gather(City, Flights, Los.Angeles:Seattle) %>%
mutate(Flights = as.numeric(Flights)) %>% spread(Arrival, Flights) %>% mutate(City = as.factor(City)) %>%
mutate(City = recode(City, Los.Angeles = "Los Angeles", San.Diego = "San Diego",
San.Francisco = "San Francisco"))
knitr::kable(new_data)
Airline | City | Delayed | On_Time |
---|---|---|---|
ALASKA | Los Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San Diego | 20 | 212 |
ALASKA | San Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AMWEST | Los Angeles | 117 | 694 |
AMWEST | Phoenix | 415 | 4840 |
AMWEST | San Diego | 65 | 383 |
AMWEST | San Francisco | 129 | 320 |
AMWEST | Seattle | 61 | 201 |
With major grooming done, it is a simple matter of creating percents of flights for each airline which were delayed. The formula for each airline is:\[ delays/all-flights\] The following steps were followed to calculate the percents, using dplyr and standard mathematical functions
Airline_Comparison
and set it equal to new_dataAirline
Percent_Delay
column using summarize()
and the above formula, applying sum()
to each column so as to take the total for all group_by()
categories of Airline
for the numerator (delays) and the denominator (delays + on-time flights), rounding to 3 places and multiplying by 100 to get the percent of each airline delayed in printing the table, I pasted the %
symbol in to make the results unmistakable, but in the data set, they are still numeric data
Airline_Comparisons <- new_data %>% group_by(Airline) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) +
sum(On_Time))), 3) * 100)
knitr::kable(Airline_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay,
"%")))
Airline | Percent_Delay |
---|---|
ALASKA | 13.3 % |
AMWEST | 10.9 % |
You can see a simple bar plot showing the percent of delays for each airline in the box below.
To show the difference in delays between cities, including all airlines, I repeated the proceedure to the right, used to sort and compare airlines, this time grouping by City
.
The Computations and tables were made using the same process used with Airlines
and saved into City_Comparisons
. Again the formual for each city was numerator (delays) and the denominator (delays + on-time flights). They were again rounded to 3 places and multiplied by 100 to reflect a true percent.
City_Comparisons <- new_data %>% group_by(City) %>% summarize(Percent_Delay = round((sum(Delayed)/(sum(Delayed) +
sum(On_Time))), 3) * 100)
knitr::kable(City_Comparisons %>% mutate(Percent_Delay = paste(Percent_Delay,
"%")))
City | Percent_Delay |
---|---|
Los Angeles | 13.1 % |
Phoenix | 7.8 % |
San Diego | 12.5 % |
San Francisco | 21.9 % |
Seattle | 15.2 % |
You can see a simple bar plot showing the percent of delays in each city in the box below.
In looking at the tables and charts for the two airlines, you can see that their delays were pretty similar around 11% and 14%. However, by plotting the delays by city, it is clear that some cities have very few delays (Phoenix around 8%) and others are late more than 20% of the time (San Francisco).
Based on the city information, and knowing that the real file has many more cities and airlines in it, a meaningful analysis should look at how the numbers of flights into heavily delayed cities impact and airlines delay rate, or if those cities appear to be more delayed because less timely airlines represent a higher proportion of their daily arrivals.
Although the data were simple, really just a 4x7 array, in order to easily use it, it was necessary to properly import, sort, modify and then tidy the set to be able to easily summarize the data included in it. The data definitely needed to be in long format.
Much of what was done with dplyr
and tidyr
could easily have been done with base R functions and other packages, using the tidy method made it possible t build upon intermediate steps in the data structuring process without editing row by row or column by column, or writing functions and using apply.
By carefully constructing the importation and data structuring process for this sample, it is highly likely that we could apply this script to the entire set of data across all airlines and cities, provided they followed the same conventions, which is what makes sorting data the Tidy Way so valuable.