Assignment Description: In this assignment we create a .CSV file with some flights data and then read it from R to tidy and transform the data.
I saved the data in a github repository in order to make it accessible from any computer, thus I will load the data from there into R:
#Load the libraries we will need to tidy and analyze our data
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
flightsData <- read.csv("https://raw.githubusercontent.com/marioipena/Assignment4DATA607/master/flightsData.csv", header = TRUE)
Let’s view our data:
flightsData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Let’s tidy the data. We will get rid of the row with no information and assign a name to the airline and status columns. We will also add the names of the airlines to the two empty cells in the airline column. Additionally we would like to transform the data from wide to long in order to prepare it for analysis:
flightsData2 <- flightsData[-c(3),]
flightsData2 <- flightsData2 %>% rename(Airline = X, Status = X.1)
flightsData2[2, 1] <- "ALASKA"
flightsData2[4, 1] <- "AM WEST"
flightsData2 <- flightsData2 %>% gather("Destination", "Flights", 3:7) %>% arrange(Airline, desc(Status), Destination)
flightsData2
## Airline Status Destination Flights
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Below are two graphs representing the number of delayed and ontime flights to each destination by airline:
airline1 <- filter(flightsData2, Airline == "ALASKA")
airline2 <- filter(flightsData2, Airline == "AM WEST")
ggplot(airline1, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
ggplot(airline2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
I have used the spread() function in order to transform the data to show the number of delayed and on time flights in two new columns, “delayed” and “on time”.
dataTran <- spread(flightsData2, Status, Flights)
dataTran
## Airline Destination delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Total delayed and on time by destination and their respective ratio:
totalCity <- dataTran %>% group_by(Destination) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalCity
## # A tibble: 5 x 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Los.Angeles 179 1191 1370 0.131 0.869
## 2 Phoenix 427 5061 5488 0.0778 0.922
## 3 San.Diego 85 595 680 0.125 0.875
## 4 San.Francisco 231 823 1054 0.219 0.781
## 5 Seattle 366 2042 2408 0.152 0.848
statusRatio <- select(totalCity, -c("Delayed", "OnTime", "Total"))
statusRatio <- statusRatio %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Destination, y = Ratio, fill = Status), data = statusRatio, stat = "identity")
We can see above that San Francisco has the largest ratio of delayed flights with about 22%, while Phoenix has the largest ratio of on time flights with about 92%.
The Destinations with the maximum and minimum number of flights respectively:
max <- totalCity %>% filter(Total == max(Total))
max
## # A tibble: 1 x 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Phoenix 427 5061 5488 0.0778 0.922
min <- totalCity %>% filter(Total == min(Total))
min
## # A tibble: 1 x 6
## Destination Delayed OnTime Total DelayRatio OnTimeRatio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 San.Diego 85 595 680 0.125 0.875
Total delayed and on time by airline and their respective ratio:
totalAirline <- dataTran %>% group_by(Airline) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalAirline
## # A tibble: 2 x 6
## Airline Delayed OnTime Total DelayRatio OnTimeRatio
## <fct> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA 501 3274 3775 0.133 0.867
## 2 AM WEST 787 6438 7225 0.109 0.891
statusRatio2 <- select(totalAirline, -c("Delayed", "OnTime", "Total"))
statusRatio2 <- statusRatio2 %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)
ggplot() + geom_bar (aes(x = Airline, y = Ratio, fill = Status), data = statusRatio2, stat = "identity")
We can see above that the ratios for delayed and on time flights for both airlines are fairly similar when destinations are combined, however AM West airlines has nearly double the number of flights than those of Alaska airlines.
Finally, we can compare the number of delayed and on time flights by destination and Airline. In the plot below we can clearly see that AM WEST airline had the most on time flights by far. However, there are two destinations to which ALASKA airline has more on time fights than AM WEST airline, and those are San Francisco and Seattle.
ggplot(flightsData2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")