Assignment 4 - Tidying and Transforming Data

Loading Libraries

The necessary libraries for data clean up are the dplyr and tidyr packages, and the package used for visiualization is ggplot2 and cowplot (arrange graphs). In addition, I added the DT library to display datatables.

library(dplyr)
library(tidyr)
library(ggplot2)
library(cowplot)
library(DT)

Reading in the Data and Inspection

The table was read in using the “tabl_df” and “read.csv” functions. The table is shown below:

raw.df <- tbl_df(read.csv("week5.csv",header=TRUE,sep=","))
raw.df

# A tibble: 5 x 7
  X       X.1     Los.Angeles Phoenix San.Diego San.Francisco Seattle
  <fct>   <fct>         <int>   <int>     <int>         <int>   <int>
1 ALASKA  on time         497     221       212           503    1841
2 ""      delayed          62      12        20           102     305
3 ""      ""               NA      NA        NA            NA      NA
4 AM WEST on time         694    4840       383           320     201
5 ""      delayed         117     415        65           129      61

The first thing that pops out is that row 3 is blank, so we can go ahead and remove that row.

raw.df <- raw.df[-c(3),]

In addition, the column headers are not accurate. Particularly the airline, status, and the period in between places with two words. Finally, I added the name of the airline to the rows which were delayed.

colnames(raw.df) <- c('Airline','Status','Los Angeles','Phoenix','San Diego','San Francisco','Seattle')
raw.df[2,1] <- raw.df[1,1]
raw.df[4,1] <- raw.df[3,1]
raw.df

# A tibble: 4 x 7
  Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
  <fct>   <fct>           <int>   <int>       <int>           <int>   <int>
1 ALASKA  on time           497     221         212             503    1841
2 ALASKA  delayed            62      12          20             102     305
3 AM WEST on time           694    4840         383             320     201
4 AM WEST delayed           117     415          65             129      61

Now that we have completed the initial clean-up, we can decide on how we want to display this data to get the most out of it.

Arranging the Data

First, I used the gather function to list all of the locations in a column as shown below:

flight.status.gather <- raw.df %>% gather(Destination,Count,3:7)
flight.status.gather

# A tibble: 20 x 4
   Airline Status  Destination   Count
   <fct>   <fct>   <chr>         <int>
 1 ALASKA  on time Los Angeles     497
 2 ALASKA  delayed Los Angeles      62
 3 AM WEST on time Los Angeles     694
 4 AM WEST delayed Los Angeles     117
 5 ALASKA  on time Phoenix         221
 6 ALASKA  delayed Phoenix          12
 7 AM WEST on time Phoenix        4840
 8 AM WEST delayed Phoenix         415
 9 ALASKA  on time San Diego       212
10 ALASKA  delayed San Diego        20
11 AM WEST on time San Diego       383
12 AM WEST delayed San Diego        65
13 ALASKA  on time San Francisco   503
14 ALASKA  delayed San Francisco   102
15 AM WEST on time San Francisco   320
16 AM WEST delayed San Francisco   129
17 ALASKA  on time Seattle        1841
18 ALASKA  delayed Seattle         305
19 AM WEST on time Seattle         201
20 AM WEST delayed Seattle          61

Next, I used the spread function to have “delayed” and “on time” be displayed as columns:

flight.status.spread <- flight.status.gather %>% spread(Status, Count)
flight.status.spread

# A tibble: 10 x 4
   Airline Destination   delayed `on time`
   <fct>   <chr>           <int>     <int>
 1 ALASKA  Los Angeles        62       497
 2 ALASKA  Phoenix            12       221
 3 ALASKA  San Diego          20       212
 4 ALASKA  San Francisco     102       503
 5 ALASKA  Seattle           305      1841
 6 AM WEST Los Angeles       117       694
 7 AM WEST Phoenix           415      4840
 8 AM WEST San Diego          65       383
 9 AM WEST San Francisco     129       320
10 AM WEST Seattle            61       201

Next, I would like to create a new column to show the total flights for each airline and destination combination, followed by the percentage of flights that are delayed. This information will allow us to see which destinations are most popular, and/or if certain airlines are preferred for certain destinations, as well as they delay percentage. In order to accomplis this, I used the mutate function:

flight.status.mutate1 <- flight.status.spread %>% mutate(Total_Flights = flight.status.spread$delayed + flight.status.spread$'on time')
flight.status.final <- flight.status.mutate1 %>% mutate(Percent_Delayed = flight.status.mutate1$delayed/flight.status.mutate1$Total_Flights*100)
flight.status.final

# A tibble: 10 x 6
   Airline Destination   delayed `on time` Total_Flights Percent_Delayed
   <fct>   <chr>           <int>     <int>         <int>           <dbl>
 1 ALASKA  Los Angeles        62       497           559           11.1 
 2 ALASKA  Phoenix            12       221           233            5.15
 3 ALASKA  San Diego          20       212           232            8.62
 4 ALASKA  San Francisco     102       503           605           16.9 
 5 ALASKA  Seattle           305      1841          2146           14.2 
 6 AM WEST Los Angeles       117       694           811           14.4 
 7 AM WEST Phoenix           415      4840          5255            7.90
 8 AM WEST San Diego          65       383           448           14.5 
 9 AM WEST San Francisco     129       320           449           28.7 
10 AM WEST Seattle            61       201           262           23.3

Finally, I created two additional tables comparing the airlines as a whole and comparing the destination delays. I used the group_by function to group by city and airliine, as well as using the summarise function to get the mean.

flight.status.location <- flight.status.final %>% group_by(Destination) %>% summarise(Average_Percent_Delayed = mean(Percent_Delayed))
flight.status.location

# A tibble: 5 x 2
  Destination   Average_Percent_Delayed
  <chr>                           <dbl>
1 Los Angeles                     12.8 
2 Phoenix                          6.52
3 San Diego                       11.6 
4 San Francisco                   22.8 
5 Seattle                         18.7

flight.status.airline <- flight.status.final %>% group_by(Airline) %>% summarise(Average_Percent_Delayed = mean(Percent_Delayed))
flight.status.airline

# A tibble: 2 x 2
  Airline Average_Percent_Delayed
  <fct>                     <dbl>
1 ALASKA                     11.2
2 AM WEST                    17.8

With these final three tables, we can move on to exploring the data.

Data Analysis

First, I wanted to plot the number of flights to each destination as a function of airline. The first point that stands out to me are the total number of flights to Phoenix, and how AM WEST virtually dominates this destination. The opposite is true for Seattle where ALASKA dominates this hub.

The more interesting figure is the percentage of flights that were delayed to each destination as a function of airline. It is evident that ALASKA has lower flight delays at every single destination. There are other factors such as cost and customer service, but this piece of information could sway a consumer into picking ALASKA over AM WEST.

gginit.count <- ggplot(flight.status.final,aes(factor(Destination),Total_Flights,fill=Airline))
gginit.delay <- ggplot(flight.status.final,aes(factor(Destination),Percent_Delayed,fill=Airline))
plottype.count <- geom_bar(stat="identity")
plottype.delay <- geom_bar(stat="identity",position="dodge")
plottheme <- theme_bw() + theme(axis.ticks.y=element_blank(),axis.text.y = element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank())
count.plot <- gginit.count + plottype.count + plottheme + xlab('Destination') + ylab('Number of Flights') + ggtitle('Total Number of Flights')
delay.plot <- gginit.delay + plottype.delay + plottheme + xlab('Destination') + ylab('Percent Delayed') + ggtitle('Percent Delayed')
plot_grid(count.plot,delay.plot,align="v",nrow=2, rel_heights=c(1/2,1/2))

Next, I wanted to display the last two data frames created in the last section in two tables. The first table analyzes the average percent delayed for flights to each destination. It shows that the flights to Phoenix are the least delayed, while flights to San Francisco are the most delayed. The second table reiterates what we saw in the ‘Percent Delayed’ bar graph, where flgihts from ALASKA airlines are less delayed than AM WEST.

datatable(flight.status.location,rownames=FALSE)

datatable(flight.status.airline,rownames=FALSE)

Assignment 4 - Tidying and Transforming Data

Ryan Gordon

02/24/2019

Loading Libraries

Reading in the Data and Inspection

Arranging the Data

Data Analysis