Getting the CSV into R

The first step is to get the CSV we created into R. First, we create a variable for the file name, aptly called fileName, for storing the string that is the path to the CSV we need to use. After that, it’s a matter of using the read.csv() function.

fileName <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Week05Assignment/arrivals.csv"
arrivals <- read.csv(fileName)
knitr::kable(arrivals)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

Transforming the CSV

Making a Tibble

Tibbles are beefed up, dumbed down, friendlier data frames. By using the as_tibble() function we can make our imported CSV into a tibble, similar to how we can use as.data.frame() in base R. We then use drop_na() to remove any rows that have NA. This might not be ideal in another dataset, but for this one, it will do since we have a separator that does nothing for us.

arrivals <- as_tibble(arrivals)
arrivals <- drop_na(arrivals)
arrivals

## # A tibble: 4 x 7
##   X       X.1     Los.Angeles Phoenix San.Diego San.Francisco Seattle
##   <fct>   <fct>         <int>   <int>     <int>         <int>   <int>
## 1 ALASKA  on time         497     221       212           503    1841
## 2 ""      delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 ""      delayed         117     415        65           129      61

Fixing the Tibble As Is

Something that is going to make dealing with this Tibble difficult is the empty strings we see, so let’s just assign them accordingly.

arrivals[2,1] <- arrivals[1,1]
arrivals[4,1] <- arrivals[3,1]
arrivals

## # A tibble: 4 x 7
##   X       X.1     Los.Angeles Phoenix San.Diego San.Francisco Seattle
##   <fct>   <fct>         <int>   <int>     <int>         <int>   <int>
## 1 ALASKA  on time         497     221       212           503    1841
## 2 ALASKA  delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Using the Tibble

The first thing we’re going to want to do is combine the airlines with the status of the flights. This is for the sake of making columns later as it will make things easier for us. To do this, we use the unite() function. We are required to provide it the name of the column, the array or list of columns we are looking to combine. In this case, we are looking to combine what is currently column X, which has the values “ALASKA” and “AM WEST”, and the column with the status of the flights, which has the values “delayed” and “on time”. If no seperator is provided, the default is an underscore. For this reason, we are going to provide the separator as " - " for clarity’s sake.

arrivals.explanation <- unite(arrivals, Airline.Status, X:X.1, sep=" - ")

## # A tibble: 4 x 6
##   Airline.Status    Los.Angeles Phoenix San.Diego San.Francisco Seattle
## * <chr>                   <int>   <int>     <int>         <int>   <int>
## 1 ALASKA - on time          497     221       212           503    1841
## 2 ALASKA - delayed           62      12        20           102     305
## 3 AM WEST - on time         694    4840       383           320     201
## 4 AM WEST - delayed         117     415        65           129      61

Next, we want to get all of the cities into their own rows and to put the amount of arrivals, delayed or on time, in a column. We’ll be using the gather() function for this, providing the tibble, the name of the first new column which will act as a key, the name of the second new column which will act as values, and the columns being used to make these new columns. In this instance, we want the cities to all be under a column named “city” and the amount of arrivals to all be under a column named “arrivalAmt”. We also want to populate the new city column with the names of the cities that are currently column names, and populate the new arrivalAmt column with the values found in each of the columns bearing city names.

arrivals.explanation <- gather(arrivals.explanation, city, arrivalAmt, Los.Angeles:Seattle)

## # A tibble: 20 x 3
##    Airline.Status    city          arrivalAmt
##    <chr>             <chr>              <int>
##  1 ALASKA - on time  Los.Angeles          497
##  2 ALASKA - delayed  Los.Angeles           62
##  3 AM WEST - on time Los.Angeles          694
##  4 AM WEST - delayed Los.Angeles          117
##  5 ALASKA - on time  Phoenix              221
##  6 ALASKA - delayed  Phoenix               12
##  7 AM WEST - on time Phoenix             4840
##  8 AM WEST - delayed Phoenix              415
##  9 ALASKA - on time  San.Diego            212
## 10 ALASKA - delayed  San.Diego             20
## 11 AM WEST - on time San.Diego            383
## 12 AM WEST - delayed San.Diego             65
## 13 ALASKA - on time  San.Francisco        503
## 14 ALASKA - delayed  San.Francisco        102
## 15 AM WEST - on time San.Francisco        320
## 16 AM WEST - delayed San.Francisco        129
## 17 ALASKA - on time  Seattle             1841
## 18 ALASKA - delayed  Seattle              305
## 19 AM WEST - on time Seattle              201
## 20 AM WEST - delayed Seattle               61

Lastly we’ll want to take the airline status column and make it into multiple columns, as our observations were of the airports in each city. To do this, we want the spread() function. We’ll have to again specify the tibble we’re using, and then we’ll want to provide the column that currently has the values we want to be individual columns, and then the column that we want to use the populate those new columns. We’re looking to have “ALASKA - on time”, “ALASKA - delayed”, “AM WEST - on time”, and “AM WEST - delayed” as columns, which then have the numbers in the column “arrivalAmt”.

arrivals.explanation <- spread(arrivals.explanation, Airline.Status, arrivalAmt)

## # A tibble: 5 x 5
##   city          `ALASKA - delayed` `ALASKA - on time` `AM WEST - delayed`
## * <chr>                      <int>              <int>               <int>
## 1 Los.Angeles                   62                497                 117
## 2 Phoenix                       12                221                 415
## 3 San.Diego                     20                212                  65
## 4 San.Francisco                102                503                 129
## 5 Seattle                      305               1841                  61
## # ... with 1 more variable: `AM WEST - on time` <int>

Shortcut for Tibbling

Not sure if that’s a word but if it wasn’t before, now it is. All of the things I explained above can be chained into a block using the %>% operator. When we use this, we don’t need to reiterate the name of the tibble each time unless we’re using a new tibble.

arrivals <- arrivals %>%
  unite(Airline.Status, X:X.1, sep=" - ") %>%
  gather(city, arrivalAmt, Los.Angeles:Seattle) %>%
  spread(Airline.Status, arrivalAmt)

## # A tibble: 5 x 5
##   city          `ALASKA - delayed` `ALASKA - on time` `AM WEST - delayed`
## * <chr>                      <int>              <int>               <int>
## 1 Los.Angeles                   62                497                 117
## 2 Phoenix                       12                221                 415
## 3 San.Diego                     20                212                  65
## 4 San.Francisco                102                503                 129
## 5 Seattle                      305               1841                  61
## # ... with 1 more variable: `AM WEST - on time` <int>

Note that it looks the exact same as the one from the previous section!

And just to make it look nicer, we’re going to replace all instances of periods in the city names with spaces using gsub(). This is a function we’ve already learned, but just for a refresher, it takes a pattern, the replacement for all matches of that pattern, and a character vector or other object that can be coerced to be a character vector. In this instance, our pattern will be “\.”, the period with leading double backslashes to escape the character so it will not be the regular expression “.”, which making a single instance of anything.

arrivals[["city"]] <- gsub("\\.", " ", arrivals[["city"]])

The final product is thus:

city	ALASKA - delayed	ALASKA - on time	AM WEST - delayed	AM WEST - on time
Los Angeles	62	497	117	694
Phoenix	12	221	415	4840
San Diego	20	212	65	383
San Francisco	102	503	129	320
Seattle	305	1841	61	201

Analysis

At first, comparing the delays as they are seems the wisest thing to do, until you realize the sheer difference in total flights between ALASKA and AM WEST.

barplot(as.matrix(select(arrivals, contains("delayed"))),
        beside=T, ylim=c(0,500), xlab="Airlines and Cities",
        ylab="Total Delayed Flights",
        main="Total Delayed Flights by Airline and City",
        legend.text=arrivals[[1]], args.legend=list(x="topright"))

A better comparison would be to compare them as percentages of total flights. To do this, we have to make another tibble and modify the values in it.

First we’re going to make a new tibble named “arrivals.perc” and select out from the arrivals tibble the city column and all columns with the word delay. From there, we are going to change the values for the delayed columns in arrivals.perc to be numbers between 0 and 100. To do this, we need to take the number present in these rows already, and then divide those by the number provided by a rowSums() function. In the rowSums function we will be adding the values under all columns bearing the name “ALASKA”, and then “AM WEST”, and applying it to the proper delayed column.

arrivals.perc <- select(arrivals, c(city,contains("delay")))
arrivals.perc[,"ALASKA - delayed"] <- (arrivals.perc[,"ALASKA - delayed"]/rowSums(select(arrivals, contains("ALASKA"))))*100
arrivals.perc[,"AM WEST - delayed"] <- (arrivals.perc[,"AM WEST - delayed"]/rowSums(select(arrivals, contains("AM WEST"))))*100

## # A tibble: 5 x 3
##   city          `ALASKA - delayed` `AM WEST - delayed`
## * <chr>                      <dbl>               <dbl>
## 1 Los Angeles                11.1                14.4 
## 2 Phoenix                     5.15                7.90
## 3 San Diego                   8.62               14.5 
## 4 San Francisco              16.9                28.7 
## 5 Seattle                    14.2                23.3

The resulting bar graph tells a much more interesting story.

barplot(as.matrix(arrivals.perc[,2:3]), beside=T,
        ylim=c(0,100), xlab="Airlines and Cities",
        ylab="Percentage of Flights Delayed",
        main="Delayed Flights as Percent Of Total Flights by Airline and City", 
        legend.text=arrivals.perc[[1]], args.legend=list(x="topright"))

Even when accounting for the larger number of flights, AM WEST still has significantly more delayed flights. The previous chart made it seem as if Phoenix had been the city receiving most delays, when in truth, it was actually San Francisco, followed closely by Seattle. Looking at both charts, we can see a similar trend in the cities that cause the most issues and the least in general for each airline, but with AM WEST, the delays are amplified. If it were simply a matter of the airports in these cities being consistently troublesome, the percentages would be similar.

sum(select(arrivals, contains("WEST")))/sum(select(arrivals, contains("ALASKA")))

## [1] 1.913907

AM WEST has almost twice the number of flights as ALASKA, so we’d expect there to be more delays overall, but the percentages should be near identical for total delays. Because it is not, we can come to the conclusion safely that this is an issue on AM WEST’s behalf. We can also safely come to the conclusion that if you are waiting for a plane from AM WEST or ALASKA in San Francisco, there’s about a 16 to 30% chance your flight will be delayed. If you want to reduce the chance of your flight being delayed and have to pick one of these two airlines, your best bet according to this data will consistently be ALASKA.

Afterthoughts

While the above conclusion is sound, having information regarding the length of delays would make it easier to discern which airline might be the better pick for each city’s airport, the better airline to take overall, and the average time delayed for each city’s airport.

gabartomeoDATA607Week05

Gabrielle Bartomeo

February 28, 2018