DATA 607 Hw 5 - Tidy Data

This homework uses dplyr, gridExtra, ggplot2 and tidyr not included in code.

flights <- read.csv("C:/Users/Exped/Desktop/Textbooks/607 Homeworks/Completed607/flights.csv") 
flights

##        X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seatlle
## 1 ALASKA on time         497     221       212           503    1841
## 2        delayed          62      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 AMWEST on time         694    4840       383           320     201
## 5        delayed         117     415        65           129      61

flights = flights[!is.na(flights$Los.Angeles),]
names(flights)[1:2] = c("Airline","Arrival")
flights

##   Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seatlle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4  AMWEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

If there were more than 4 rows, I Would use the seq() method to fill in the table. (Provided there was an innate order to the flights data.)

flights$Airline[2] = 'ALASKA'
flights$Airline[4] = 'AMWEST'

The next two lines of code, we use Tidyr to gather all the individual city columns between columns 2 and 8 Once we have that, we continue to tidy the dataframe by spreading the data so that The Arrival column, becomes less redundant by turning its complementary values into their own columns, while eliminating the Arrival column.

flights = gather(flights,"City","Freq",3:7)
flights = spread(flights,Arrival,Freq)

flights = mutate(flights, 
          total = delayed + `on time`)
flights = mutate(flights,
          delay = round(100*delayed/(`on time` + delayed),2))

flights

##    Airline          City delayed on time total delay
## 1   ALASKA   Los.Angeles      62     497   559 11.09
## 2   ALASKA       Phoenix      12     221   233  5.15
## 3   ALASKA     San.Diego      20     212   232  8.62
## 4   ALASKA San.Francisco     102     503   605 16.86
## 5   ALASKA       Seatlle     305    1841  2146 14.21
## 6   AMWEST   Los.Angeles     117     694   811 14.43
## 7   AMWEST       Phoenix     415    4840  5255  7.90
## 8   AMWEST     San.Diego      65     383   448 14.51
## 9   AMWEST San.Francisco     129     320   449 28.73
## 10  AMWEST       Seatlle      61     201   262 23.28

avgDelayALASKA = mean(select(flights,delay)[[1]][1:5])
avgDelayAMWEST = mean(select(flights,delay)[[1]][6:10])  
  
avgDelayALASKA # Average of probability of delayed flights per city (ALASKA airlines)

## [1] 11.186

avgDelayAMWEST # Average of probability of delayed flights per city (AMWEST airlines)

## [1] 17.77

The numbers above imply that if ALASKA airlines were to host flights to other cities, we could expect less delayed flights than if AMWEST were to host the same flights. If we look at the data from some ggplots, we’ll see that AMWEST has more delays in every city, and conducts a majority of their flights from Phoenix. Interesting to note in regards to AMWEST, that as the number of flights per city decrease, the ratio of delayed flights increase.

figure1 =ggplot(flights, aes(x = Airline, y=delay/100, fill = City)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Average probability of delay") 

figure2 = ggplot(flights, aes(x = Airline, y=total, fill = City)) +
          geom_bar(stat="identity",position="dodge") +
          xlab("Airlines") + ylab("Amount of flights") 
grid.arrange(figure1,figure2,ncol=2)

Next we demonstrate some piping (not TOO good at yet, getting then hang of it), and we get the average probability of delay per airline. This is a better approach than the average of individual probabilities because it accounts for the variance in number of city flights.

summarisedData = flights %>% group_by(Airline) %>% summarise(totalDelayed = sum(delayed), totalOntime = sum(`total`), avgDelay = totalDelayed/totalOntime*100)  
summarisedData

## # A tibble: 2 × 4
##   Airline totalDelayed totalOntime avgDelay
##    <fctr>        <int>       <int>    <dbl>
## 1  ALASKA          501        3775 13.27152
## 2  AMWEST          787        7225 10.89273

From here, we see that the above statement is not wrong, but does not give us an accurate account to the majority of all flights per airline; because AMWEST hosts a majority of their flights from Phoenix, and there are fewer delays in Phoenix for AMWEST, the actual probability of a delay for a randomly selected flight from AMWEST is actually lower than ALASKA airlines which did show us less probability for delays specific for a newly acquired destination (city.)

DATA 607 Hw 5 - Tidy Data

Michael Muller

March 5, 2017