The first step is to get the CSV we created into R. First, we create a variable for the file name, aptly called fileName, for storing the string that is the path to the CSV we need to use. After that, it’s a matter of using the read.csv() function.
fileName <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Week05Assignment/arrivals.csv"
arrivals <- read.csv(fileName)
knitr::kable(arrivals)
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
Tibbles are beefed up, dumbed down, friendlier data frames. By using the as_tibble() function we can make our imported CSV into a tibble, similar to how we can use as.data.frame() in base R. We then use drop_na() to remove any rows that have NA. This might not be ideal in another dataset, but for this one, it will do since we have a separator that does nothing for us.
arrivals <- as_tibble(arrivals)
arrivals <- drop_na(arrivals)
arrivals
## # A tibble: 4 x 7
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## <fct> <fct> <int> <int> <int> <int> <int>
## 1 ALASKA on time 497 221 212 503 1841
## 2 "" delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 "" delayed 117 415 65 129 61
Something that is going to make dealing with this Tibble difficult is the empty strings we see, so let’s just assign them accordingly.
arrivals[2,1] <- arrivals[1,1]
arrivals[4,1] <- arrivals[3,1]
arrivals
## # A tibble: 4 x 7
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## <fct> <fct> <int> <int> <int> <int> <int>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
The first thing we’re going to want to do is combine the airlines with the status of the flights. This is for the sake of making columns later as it will make things easier for us. To do this, we use the unite() function. We are required to provide it the name of the column, the array or list of columns we are looking to combine. In this case, we are looking to combine what is currently column X, which has the values “ALASKA” and “AM WEST”, and the column with the status of the flights, which has the values “delayed” and “on time”. If no seperator is provided, the default is an underscore. For this reason, we are going to provide the separator as " - " for clarity’s sake.
arrivals.explanation <- unite(arrivals, Airline.Status, X:X.1, sep=" - ")
## # A tibble: 4 x 6
## Airline.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## * <chr> <int> <int> <int> <int> <int>
## 1 ALASKA - on time 497 221 212 503 1841
## 2 ALASKA - delayed 62 12 20 102 305
## 3 AM WEST - on time 694 4840 383 320 201
## 4 AM WEST - delayed 117 415 65 129 61
Next, we want to get all of the cities into their own rows and to put the amount of arrivals, delayed or on time, in a column. We’ll be using the gather() function for this, providing the tibble, the name of the first new column which will act as a key, the name of the second new column which will act as values, and the columns being used to make these new columns. In this instance, we want the cities to all be under a column named “city” and the amount of arrivals to all be under a column named “arrivalAmt”. We also want to populate the new city column with the names of the cities that are currently column names, and populate the new arrivalAmt column with the values found in each of the columns bearing city names.
arrivals.explanation <- gather(arrivals.explanation, city, arrivalAmt, Los.Angeles:Seattle)
## # A tibble: 20 x 3
## Airline.Status city arrivalAmt
## <chr> <chr> <int>
## 1 ALASKA - on time Los.Angeles 497
## 2 ALASKA - delayed Los.Angeles 62
## 3 AM WEST - on time Los.Angeles 694
## 4 AM WEST - delayed Los.Angeles 117
## 5 ALASKA - on time Phoenix 221
## 6 ALASKA - delayed Phoenix 12
## 7 AM WEST - on time Phoenix 4840
## 8 AM WEST - delayed Phoenix 415
## 9 ALASKA - on time San.Diego 212
## 10 ALASKA - delayed San.Diego 20
## 11 AM WEST - on time San.Diego 383
## 12 AM WEST - delayed San.Diego 65
## 13 ALASKA - on time San.Francisco 503
## 14 ALASKA - delayed San.Francisco 102
## 15 AM WEST - on time San.Francisco 320
## 16 AM WEST - delayed San.Francisco 129
## 17 ALASKA - on time Seattle 1841
## 18 ALASKA - delayed Seattle 305
## 19 AM WEST - on time Seattle 201
## 20 AM WEST - delayed Seattle 61
Lastly we’ll want to take the airline status column and make it into multiple columns, as our observations were of the airports in each city. To do this, we want the spread() function. We’ll have to again specify the tibble we’re using, and then we’ll want to provide the column that currently has the values we want to be individual columns, and then the column that we want to use the populate those new columns. We’re looking to have “ALASKA - on time”, “ALASKA - delayed”, “AM WEST - on time”, and “AM WEST - delayed” as columns, which then have the numbers in the column “arrivalAmt”.
arrivals.explanation <- spread(arrivals.explanation, Airline.Status, arrivalAmt)
## # A tibble: 5 x 5
## city `ALASKA - delayed` `ALASKA - on time` `AM WEST - delayed`
## * <chr> <int> <int> <int>
## 1 Los.Angeles 62 497 117
## 2 Phoenix 12 221 415
## 3 San.Diego 20 212 65
## 4 San.Francisco 102 503 129
## 5 Seattle 305 1841 61
## # ... with 1 more variable: `AM WEST - on time` <int>
Not sure if that’s a word but if it wasn’t before, now it is. All of the things I explained above can be chained into a block using the %>% operator. When we use this, we don’t need to reiterate the name of the tibble each time unless we’re using a new tibble.
arrivals <- arrivals %>%
unite(Airline.Status, X:X.1, sep=" - ") %>%
gather(city, arrivalAmt, Los.Angeles:Seattle) %>%
spread(Airline.Status, arrivalAmt)
## # A tibble: 5 x 5
## city `ALASKA - delayed` `ALASKA - on time` `AM WEST - delayed`
## * <chr> <int> <int> <int>
## 1 Los.Angeles 62 497 117
## 2 Phoenix 12 221 415
## 3 San.Diego 20 212 65
## 4 San.Francisco 102 503 129
## 5 Seattle 305 1841 61
## # ... with 1 more variable: `AM WEST - on time` <int>
Note that it looks the exact same as the one from the previous section!
And just to make it look nicer, we’re going to replace all instances of periods in the city names with spaces using gsub(). This is a function we’ve already learned, but just for a refresher, it takes a pattern, the replacement for all matches of that pattern, and a character vector or other object that can be coerced to be a character vector. In this instance, our pattern will be “\.”, the period with leading double backslashes to escape the character so it will not be the regular expression “.”, which making a single instance of anything.
arrivals[["city"]] <- gsub("\\.", " ", arrivals[["city"]])
The final product is thus:
| city | ALASKA - delayed | ALASKA - on time | AM WEST - delayed | AM WEST - on time |
|---|---|---|---|---|
| Los Angeles | 62 | 497 | 117 | 694 |
| Phoenix | 12 | 221 | 415 | 4840 |
| San Diego | 20 | 212 | 65 | 383 |
| San Francisco | 102 | 503 | 129 | 320 |
| Seattle | 305 | 1841 | 61 | 201 |
At first, comparing the delays as they are seems the wisest thing to do, until you realize the sheer difference in total flights between ALASKA and AM WEST.
barplot(as.matrix(select(arrivals, contains("delayed"))),
beside=T, ylim=c(0,500), xlab="Airlines and Cities",
ylab="Total Delayed Flights",
main="Total Delayed Flights by Airline and City",
legend.text=arrivals[[1]], args.legend=list(x="topright"))
A better comparison would be to compare them as percentages of total flights. To do this, we have to make another tibble and modify the values in it.
First we’re going to make a new tibble named “arrivals.perc” and select out from the arrivals tibble the city column and all columns with the word delay. From there, we are going to change the values for the delayed columns in arrivals.perc to be numbers between 0 and 100. To do this, we need to take the number present in these rows already, and then divide those by the number provided by a rowSums() function. In the rowSums function we will be adding the values under all columns bearing the name “ALASKA”, and then “AM WEST”, and applying it to the proper delayed column.
arrivals.perc <- select(arrivals, c(city,contains("delay")))
arrivals.perc[,"ALASKA - delayed"] <- (arrivals.perc[,"ALASKA - delayed"]/rowSums(select(arrivals, contains("ALASKA"))))*100
arrivals.perc[,"AM WEST - delayed"] <- (arrivals.perc[,"AM WEST - delayed"]/rowSums(select(arrivals, contains("AM WEST"))))*100
## # A tibble: 5 x 3
## city `ALASKA - delayed` `AM WEST - delayed`
## * <chr> <dbl> <dbl>
## 1 Los Angeles 11.1 14.4
## 2 Phoenix 5.15 7.90
## 3 San Diego 8.62 14.5
## 4 San Francisco 16.9 28.7
## 5 Seattle 14.2 23.3
The resulting bar graph tells a much more interesting story.
barplot(as.matrix(arrivals.perc[,2:3]), beside=T,
ylim=c(0,100), xlab="Airlines and Cities",
ylab="Percentage of Flights Delayed",
main="Delayed Flights as Percent Of Total Flights by Airline and City",
legend.text=arrivals.perc[[1]], args.legend=list(x="topright"))
Even when accounting for the larger number of flights, AM WEST still has significantly more delayed flights. The previous chart made it seem as if Phoenix had been the city receiving most delays, when in truth, it was actually San Francisco, followed closely by Seattle. Looking at both charts, we can see a similar trend in the cities that cause the most issues and the least in general for each airline, but with AM WEST, the delays are amplified. If it were simply a matter of the airports in these cities being consistently troublesome, the percentages would be similar.
sum(select(arrivals, contains("WEST")))/sum(select(arrivals, contains("ALASKA")))
## [1] 1.913907
AM WEST has almost twice the number of flights as ALASKA, so we’d expect there to be more delays overall, but the percentages should be near identical for total delays. Because it is not, we can come to the conclusion safely that this is an issue on AM WEST’s behalf. We can also safely come to the conclusion that if you are waiting for a plane from AM WEST or ALASKA in San Francisco, there’s about a 16 to 30% chance your flight will be delayed. If you want to reduce the chance of your flight being delayed and have to pick one of these two airlines, your best bet according to this data will consistently be ALASKA.
While the above conclusion is sound, having information regarding the length of delays would make it easier to discern which airline might be the better pick for each city’s airport, the better airline to take overall, and the average time delayed for each city’s airport.