First we load the CSV file by using the read_csv
function from the readr
package. The CSV file was created by copying exactly what was presented in the assignment PDF (shown below). This includes missing fields, blank lines, and all.
The original table
rawdata <- read_csv("week5.csv",col_names=TRUE)
rawdata
## # A tibble: 5 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 Alaska on-ti… 491 221 212 503 1841
## 2 <NA> delay… 62 12 20 102 503
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on-ti… 694 4840 383 320 201
## 5 <NA> delay… 117 415 65 129 61
Looking at what was read into R, we see a few different issues with our raw data:
We’ll take each of these in turn in the next section and show how we can correct them.
First we tackel the two columns that are missing names. This is relatively easy to correct thanks to the dplyr
package and the rename
function:
rawdata <- rename(rawdata, airline = X1, status = X2)
rawdata
## # A tibble: 5 x 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 Alaska on-ti… 491 221 212 503 1841
## 2 <NA> delay… 62 12 20 102 503
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on-ti… 694 4840 383 320 201
## 5 <NA> delay… 117 415 65 129 61
Now every column has a meaningful name. Note that the city names are non-standard (they have spaces). We could correct them as well, but when we begin to tidy the data having them with their original names will be useful.
We had a blank row in our source file, so it stands to reason that we’d have a blank row in our data after input. To remove this empty observation, we can use the filter
function in dplyr
. By looking for observations without a blank status column, we can keep only complete rows.
rawdata <- rawdata %>% filter(!is.na(status))
rawdata
## # A tibble: 4 x 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 Alaska on-ti… 491 221 212 503 1841
## 2 <NA> delay… 62 12 20 102 503
## 3 AM WEST on-ti… 694 4840 383 320 201
## 4 <NA> delay… 117 415 65 129 61
Finally, because the table the data came from was formatted such that the airline wasn’t labeled on each line, we have NAs in that column as well. These are also rectified quite easily by using the fill
function, which is from the tidyr
package.
rawdata <- fill(rawdata,airline)
rawdata
## # A tibble: 4 x 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 Alaska on-ti… 491 221 212 503 1841
## 2 Alaska delay… 62 12 20 102 503
## 3 AM WEST on-ti… 694 4840 383 320 201
## 4 AM WEST delay… 117 415 65 129 61
Now we have a more complete and clean data set. Our next step will be to tidy it.
By putting our data into a “tidy” format, we make analysis within R easier to do.
The largest issue with our raw data is that each city is an observation, yet it is stored in separate columns, like variables. What we need to do is bring these columns down into rows.
We use the gather
function from tidyr
to do rearrange our data. We pass it the names of the columns we need to gather into observations (here, the city columns), the “key” or name of the new column with the gathered values, and a name for the new value column (called “flights” here):
tidydata <- rawdata %>% gather(`Los Angeles`,Phoenix,`San Diego`,`San Francisco`,Seattle, key="city", value="flights")
tidydata
## # A tibble: 20 x 4
## airline status city flights
## <chr> <chr> <chr> <int>
## 1 Alaska on-time Los Angeles 491
## 2 Alaska delayed Los Angeles 62
## 3 AM WEST on-time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on-time Phoenix 221
## 6 Alaska delayed Phoenix 12
## 7 AM WEST on-time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 Alaska on-time San Diego 212
## 10 Alaska delayed San Diego 20
## 11 AM WEST on-time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 Alaska on-time San Francisco 503
## 14 Alaska delayed San Francisco 102
## 15 AM WEST on-time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 Alaska on-time Seattle 1841
## 18 Alaska delayed Seattle 503
## 19 AM WEST on-time Seattle 201
## 20 AM WEST delayed Seattle 61
Now, each city is an observation as it ought to be.
We also have the reverse problem with variables listed as observations, namely the status column. We need to move these into columns by using the tidyr
package and the spread
function.
# Since the "on-time" value gives us a non-standard column name, we change it with rename.
tidydata <- spread(tidydata,key=status, value=flights) %>% rename(on_time = `on-time`)
tidydata
## # A tibble: 10 x 4
## airline city delayed on_time
## <chr> <chr> <int> <int>
## 1 Alaska Los Angeles 62 491
## 2 Alaska Phoenix 12 221
## 3 Alaska San Diego 20 212
## 4 Alaska San Francisco 102 503
## 5 Alaska Seattle 503 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Now the two variables (delayed flights, and on-time flights) are in their proper place in the columns. OUr data set is now tidy.
Now that our data is in a tidy format, our analysis can proceed.
First, we’ll look at each airline’s service into each airport:
# Total flights by city and airline
totalByCity <- tidydata %>%
group_by(airline, city) %>%
summarize(total_flights = delayed + on_time)
# Plot
ggplot(totalByCity, aes(x=city, y=total_flights, fill=airline)) +
geom_col(position="dodge") + ggtitle("Flights by City and Airline") +
xlab("City") + ylab("# Flights")
As expected, each airline does not have the same number of flights coming into the same city, so we cannot compare the delays with raw counts. Instead we need a ratio for each.
# Delay ratio by city and airline
delayByCity <- tidydata %>%
mutate(totalFlights = delayed + on_time, delayRatio = delayed/totalFlights) %>%
group_by(airline, city) %>%
summarize(delay = mean(delayRatio))
# Plot
ggplot(delayByCity, aes(x=city, y=delay, fill=airline)) +
geom_col(position="dodge") + ggtitle("Delay Ratio by City and Airline") +
xlab("City") + ylab("# Flights") + labs(fill="Airline")
Looking at the graph, we can quickly see that AM WEST has a higher ratio of delayed flights to total flights in every city they fly to, when compared to Alaska Airlines. In some cases, the difference is slight (Seattle) while in others it is significant (San Francisco).