Inspired by this blog post (https://www.r-bloggers.com/how-to-reshape-data-in-r-tidyr-vs-reshape2/), where appropriate I will show how to perform the same operation using two different functions, either reshape2 or base R versus the tidyverse tidyr/dplyr.
My main experience is using base R, with a bit of reshape2 for simple operations like switching between wide and long. Here I typically first looked into how to perform the function using my existing knowledge, before also searching for the equivalent in the tidyverse.
Load all libraries we will need.
library(stringr)
library(reshape2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
Start by reading in the CSV file.
This CSV file was created by putting the information into Excel exactly as it looked in the description, then converting this sheet to CSV.
airlines <- read.csv("https://raw.githubusercontent.com/heathergeiger/Tidying_airlines_data/master/alaska_vs_am_west_airlines.csv",check.names=FALSE,stringsAsFactors=FALSE)
airlines
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
Remove empty lines and make column headers more descriptive.
airlines <- airlines[which(airlines[,2] == "on time" | airlines[,2] == "delayed"),]
colnames(airlines)[1:2] <- c("Airline","Status")
airlines
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
Right now some columns are going to be character that should be numeric, because numbers greater than 1,000 are written with a comma.
Remove the commas using str_replace from the stringr package.
Then apply as.numeric function to convert all the city columns to numeric.
for(i in 3:ncol(airlines))
{
airlines[,i] <- str_replace(airlines[,i],pattern=",",replacement="")
airlines[,i] <- as.numeric(airlines[,i])
}
airlines
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Check the sum of one of the columns that formerly had a comma in number to be sure this worked.
sum(airlines[,4])
## [1] 5488
221 + 12 + 4840 + 415
## [1] 5488
Looks like removing the commas, then applying as.numeric, worked well.
Before we can convert to long format, we need to repeat the value in Airline column that is there for on time, immediately after for delayed.
airlines$Airline <- rep(airlines$Airline[which(airlines$Airline != "")],each=2)
airlines
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Now ready to convert to long format.
One way to convert from wide to long format is using function melt from the reshape2 package.
Another is using the gather function.
melt_result <- melt(airlines,variable.name="City",value.name="Flights",id.vars=c("Airline","Status"))
melt_result
## Airline Status City Flights
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 Alaska on time San Diego 212
## 10 Alaska delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 Alaska on time San Francisco 503
## 14 Alaska delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 Alaska on time Seattle 1841
## 18 Alaska delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
gather_result <- gather(airlines,City,Flights,-Airline,-Status)
gather_result
## Airline Status City Flights
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 Alaska on time San Diego 212
## 10 Alaska delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 Alaska on time San Francisco 503
## 14 Alaska delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 Alaska on time Seattle 1841
## 18 Alaska delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
These are identical. Since this assignment is supposed to be based on tidyverse, let’s use the gather result.
airlines <- gather_result
This is mostly what we want, except we really want two columns for on time vs. delayed for each instance of airline + city.
For this, we could use either the spread function from tidyr or the dcast function from reshape2.
One we show that they make the same result, reassign airlines to the result of spread.
dcast_result <- dcast(airlines,formula = Airline + City ~ Status,value.var="Flights")
dcast_result
## Airline City delayed on time
## 1 Alaska Los Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San Diego 20 212
## 4 Alaska San Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
spread_result <- spread(airlines,Status,Flights)
spread_result
## Airline City delayed on time
## 1 Alaska Los Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San Diego 20 212
## 4 Alaska San Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
airlines <- spread_result
One last thing - change “on time” to “on.time” to be more R-friendly.
colnames(airlines)[4] <- "on.time"
Now our data is pretty tidy, but we need to do some transformations to get anything meaningful.
Mainly for each airline in each city, we’ll want to get the rate of delays and the total number of flights.
Add this information to airlines using either transform (base R) or mutate (dplyr).
transform_result <- transform(airlines,total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
transform_result
## Airline City delayed on.time total.flights delayed.rate
## 1 Alaska Los Angeles 62 497 559 11.091234
## 2 Alaska Phoenix 12 221 233 5.150215
## 3 Alaska San Diego 20 212 232 8.620690
## 4 Alaska San Francisco 102 503 605 16.859504
## 5 Alaska Seattle 305 1841 2146 14.212488
## 6 AM WEST Los Angeles 117 694 811 14.426634
## 7 AM WEST Phoenix 415 4840 5255 7.897241
## 8 AM WEST San Diego 65 383 448 14.508929
## 9 AM WEST San Francisco 129 320 449 28.730512
## 10 AM WEST Seattle 61 201 262 23.282443
mutate_result <- mutate(airlines,total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
mutate_result
## Airline City delayed on.time total.flights delayed.rate
## 1 Alaska Los Angeles 62 497 559 11.091234
## 2 Alaska Phoenix 12 221 233 5.150215
## 3 Alaska San Diego 20 212 232 8.620690
## 4 Alaska San Francisco 102 503 605 16.859504
## 5 Alaska Seattle 305 1841 2146 14.212488
## 6 AM WEST Los Angeles 117 694 811 14.426634
## 7 AM WEST Phoenix 415 4840 5255 7.897241
## 8 AM WEST San Diego 65 383 448 14.508929
## 9 AM WEST San Francisco 129 320 449 28.730512
## 10 AM WEST Seattle 61 201 262 23.282443
In this case, the syntax and result are identical.
However, the tidyverse can also be done via piping, which can be useful when we have more complex operations to perform.
Let’s show this here. Then, reassign airlines to the result of mutate.
mutate_via_pipe_result <- airlines %>% mutate(total.flights = on.time + delayed,delayed.rate = (delayed*100)/(on.time + delayed))
mutate_via_pipe_result
## Airline City delayed on.time total.flights delayed.rate
## 1 Alaska Los Angeles 62 497 559 11.091234
## 2 Alaska Phoenix 12 221 233 5.150215
## 3 Alaska San Diego 20 212 232 8.620690
## 4 Alaska San Francisco 102 503 605 16.859504
## 5 Alaska Seattle 305 1841 2146 14.212488
## 6 AM WEST Los Angeles 117 694 811 14.426634
## 7 AM WEST Phoenix 415 4840 5255 7.897241
## 8 AM WEST San Diego 65 383 448 14.508929
## 9 AM WEST San Francisco 129 320 449 28.730512
## 10 AM WEST Seattle 61 201 262 23.282443
airlines <- mutate_result
Let’s make a side-by-side barplot where we plot the delay rate by city, putting the two airlines side-by-side.
ggplot(airlines,aes(x = City,y = delayed.rate,fill=Airline)) +
geom_bar(stat='identity',position='dodge') +
ylab("Percent of flights delayed") +
theme(axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Based on this, it seems pretty clear that Alaska airlines performs better, having a lower delay rate in every city.
Just for fun, let’s try to make a similar plot in base R.
We’ll have to convert the data back to wide format first. Then, use barplot function from base R.
delayed_rates_wide <- spread(airlines[,c("Airline","City","delayed.rate")],City,delayed.rate)
rownames(delayed_rates_wide) <- delayed_rates_wide$Airline
delayed_rates_wide <- delayed_rates_wide[,2:ncol(delayed_rates_wide)]
barplot(as.matrix(delayed_rates_wide),beside=T,las=2,ylab="Percent of flights delayed",col=c("red","blue"))
legend("topleft",legend=rownames(delayed_rates_wide),col=c("red","blue"),pch=15,bty="n")
That actually wasn’t so bad. The major issue here though, is that the column names end up not fitting on the page using base R with default settings.
I’m sure there is a way to get them to fit better by fiddling with the margins, but in this case it’s easier to just use ggplot2.
In this case, the conclusion is pretty clear when looking city-by-city.
But what if we didn’t have the city-by-city data, but only aggregate data by airline?
Use the aggregate function from base R or group_by and summarize_all from dplyr to sum delayed and on time per airline.
aggregate_result <- aggregate(.~Airline,FUN=sum,data=airlines[,c("Airline","delayed","on.time")])
aggregate_result
## Airline delayed on.time
## 1 Alaska 501 3274
## 2 AM WEST 787 6438
summarize_result <- airlines[,c("Airline","delayed","on.time")] %>% group_by(Airline) %>% summarize_all(funs(sum))
summarize_result
## # A tibble: 2 x 3
## Airline delayed on.time
## <chr> <dbl> <dbl>
## 1 Alaska 501 3274
## 2 AM WEST 787 6438
Now, get a delayed rate similar to how we did previously.
Let’s get fancy and pipe the result of summarize right into mutate, without using object summarize_result.
airlines[,c("Airline","delayed","on.time")] %>% group_by(Airline) %>% summarize_all(funs(sum)) %>% mutate(delayed.rate = delayed*100/(delayed + on.time))
## # A tibble: 2 x 4
## Airline delayed on.time delayed.rate
## <chr> <dbl> <dbl> <dbl>
## 1 Alaska 501 3274 13.27152
## 2 AM WEST 787 6438 10.89273
Interesting! If we do not separate by city, AM WEST will appear to be better in terms of delay rate.
Let’s make an explanatory plot to show how this might occur.
Make a stacked barplot with the percent of flights coming from each city per airline.
Combined with the plot above, which shows trends of delay rates by city, we may get a better sense of how this happened.
ggplot(airlines,aes(x = Airline,y=total.flights,fill=City)) +
geom_bar(position = "fill",stat = "identity") +
scale_y_continuous(labels = percent_format()) +
ylab("Percent of flights coming from each city") +
scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2")) #Use colorblind vector as suggested here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
We find that AM WEST has the majority of flights coming from Phoenix. And we saw from the first chart that Phoenix tends to be a lower-delay city regardless of airline.
So if we do not separate by city, AM WEST will appear to have a lower delay rate than Alaska. Whereas if we normalize by city, we come to the correct conclusion that Alaska performs better in terms of delays.