Assignment 4: Tidying and Transforming Data

Mario Pena

9/29/2019

Assignment Description: In this assignment we create a .CSV file with some flights data and then read it from R to tidy and transform the data.

I saved the data in a github repository in order to make it accessible from any computer, thus I will load the data from there into R:

#Load the libraries we will need to tidy and analyze our data
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

flightsData <- read.csv("https://raw.githubusercontent.com/marioipena/Assignment4DATA607/master/flightsData.csv", header = TRUE)

Let’s view our data:

flightsData
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Let’s tidy the data. We will get rid of the row with no information and assign a name to the airline and status columns. We will also add the names of the airlines to the two empty cells in the airline column. Additionally we would like to transform the data from wide to long in order to prepare it for analysis:

flightsData2 <- flightsData[-c(3),]
flightsData2 <- flightsData2 %>% rename(Airline = X, Status = X.1)
flightsData2[2, 1] <- "ALASKA"
flightsData2[4, 1] <- "AM WEST"
flightsData2 <- flightsData2 %>% gather("Destination", "Flights", 3:7) %>% arrange(Airline, desc(Status), Destination)
flightsData2
##    Airline  Status   Destination Flights
## 1   ALASKA on time   Los.Angeles     497
## 2   ALASKA on time       Phoenix     221
## 3   ALASKA on time     San.Diego     212
## 4   ALASKA on time San.Francisco     503
## 5   ALASKA on time       Seattle    1841
## 6   ALASKA delayed   Los.Angeles      62
## 7   ALASKA delayed       Phoenix      12
## 8   ALASKA delayed     San.Diego      20
## 9   ALASKA delayed San.Francisco     102
## 10  ALASKA delayed       Seattle     305
## 11 AM WEST on time   Los.Angeles     694
## 12 AM WEST on time       Phoenix    4840
## 13 AM WEST on time     San.Diego     383
## 14 AM WEST on time San.Francisco     320
## 15 AM WEST on time       Seattle     201
## 16 AM WEST delayed   Los.Angeles     117
## 17 AM WEST delayed       Phoenix     415
## 18 AM WEST delayed     San.Diego      65
## 19 AM WEST delayed San.Francisco     129
## 20 AM WEST delayed       Seattle      61

Below are two graphs representing the number of delayed and ontime flights to each destination by airline:

airline1 <- filter(flightsData2, Airline == "ALASKA")
airline2 <- filter(flightsData2, Airline == "AM WEST")

ggplot(airline1, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")
ggplot(airline2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")

I have used the spread() function in order to transform the data to show the number of delayed and on time flights in two new columns, “delayed” and “on time”.

dataTran <- spread(flightsData2, Status, Flights)
dataTran
##    Airline   Destination delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

Total delayed and on time by destination and their respective ratio:

totalCity <- dataTran %>% group_by(Destination) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalCity
## # A tibble: 5 x 6
##   Destination   Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>           <int>  <int> <int>      <dbl>       <dbl>
## 1 Los.Angeles       179   1191  1370     0.131        0.869
## 2 Phoenix           427   5061  5488     0.0778       0.922
## 3 San.Diego          85    595   680     0.125        0.875
## 4 San.Francisco     231    823  1054     0.219        0.781
## 5 Seattle           366   2042  2408     0.152        0.848
statusRatio <- select(totalCity, -c("Delayed", "OnTime", "Total"))
statusRatio <- statusRatio %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)

ggplot() + geom_bar (aes(x = Destination, y = Ratio, fill = Status), data = statusRatio, stat = "identity")

We can see above that San Francisco has the largest ratio of delayed flights with about 22%, while Phoenix has the largest ratio of on time flights with about 92%.

The Destinations with the maximum and minimum number of flights respectively:

max <- totalCity %>% filter(Total == max(Total))
max
## # A tibble: 1 x 6
##   Destination Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>         <int>  <int> <int>      <dbl>       <dbl>
## 1 Phoenix         427   5061  5488     0.0778       0.922
min <- totalCity %>% filter(Total == min(Total))
min
## # A tibble: 1 x 6
##   Destination Delayed OnTime Total DelayRatio OnTimeRatio
##   <chr>         <int>  <int> <int>      <dbl>       <dbl>
## 1 San.Diego        85    595   680      0.125       0.875

Total delayed and on time by airline and their respective ratio:

totalAirline <- dataTran %>% group_by(Airline) %>% summarise(Delayed = sum(delayed),OnTime = sum(`on time`), Total = Delayed + OnTime) %>% mutate(DelayRatio = Delayed/Total, OnTimeRatio = OnTime/Total)
totalAirline
## # A tibble: 2 x 6
##   Airline Delayed OnTime Total DelayRatio OnTimeRatio
##   <fct>     <int>  <int> <int>      <dbl>       <dbl>
## 1 ALASKA      501   3274  3775      0.133       0.867
## 2 AM WEST     787   6438  7225      0.109       0.891
statusRatio2 <- select(totalAirline, -c("Delayed", "OnTime", "Total"))
statusRatio2 <- statusRatio2 %>% rename(Delay = DelayRatio, OnTime = OnTimeRatio) %>% gather("Status", "Ratio", 2:3)

ggplot() + geom_bar (aes(x = Airline, y = Ratio, fill = Status), data = statusRatio2, stat = "identity")

We can see above that the ratios for delayed and on time flights for both airlines are fairly similar when destinations are combined, however AM West airlines has nearly double the number of flights than those of Alaska airlines.

Finally, we can compare the number of delayed and on time flights by destination and Airline. In the plot below we can clearly see that AM WEST airline had the most on time flights by far. However, there are two destinations to which ALASKA airline has more on time fights than AM WEST airline, and those are San Francisco and Seattle.

ggplot(flightsData2, aes(Destination, Flights, fill = Status, color = Airline)) + geom_bar(position="dodge", stat="identity")