The necessary libraries for data clean up are the dplyr and tidyr packages, and the package used for visiualization is ggplot2 and cowplot (arrange graphs). In addition, I added the DT library to display datatables.
library(dplyr)
library(tidyr)
library(ggplot2)
library(cowplot)
library(DT)The table was read in using the “tabl_df” and “read.csv” functions. The table is shown below:
raw.df <- tbl_df(read.csv("week5.csv",header=TRUE,sep=","))
raw.df# A tibble: 5 x 7
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
<fct> <fct> <int> <int> <int> <int> <int>
1 ALASKA on time 497 221 212 503 1841
2 "" delayed 62 12 20 102 305
3 "" "" NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 "" delayed 117 415 65 129 61
The first thing that pops out is that row 3 is blank, so we can go ahead and remove that row.
raw.df <- raw.df[-c(3),]In addition, the column headers are not accurate. Particularly the airline, status, and the period in between places with two words. Finally, I added the name of the airline to the rows which were delayed.
colnames(raw.df) <- c('Airline','Status','Los Angeles','Phoenix','San Diego','San Francisco','Seattle')
raw.df[2,1] <- raw.df[1,1]
raw.df[4,1] <- raw.df[3,1]
raw.df# A tibble: 4 x 7
Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
<fct> <fct> <int> <int> <int> <int> <int>
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
Now that we have completed the initial clean-up, we can decide on how we want to display this data to get the most out of it.
First, I used the gather function to list all of the locations in a column as shown below:
flight.status.gather <- raw.df %>% gather(Destination,Count,3:7)
flight.status.gather# A tibble: 20 x 4
Airline Status Destination Count
<fct> <fct> <chr> <int>
1 ALASKA on time Los Angeles 497
2 ALASKA delayed Los Angeles 62
3 AM WEST on time Los Angeles 694
4 AM WEST delayed Los Angeles 117
5 ALASKA on time Phoenix 221
6 ALASKA delayed Phoenix 12
7 AM WEST on time Phoenix 4840
8 AM WEST delayed Phoenix 415
9 ALASKA on time San Diego 212
10 ALASKA delayed San Diego 20
11 AM WEST on time San Diego 383
12 AM WEST delayed San Diego 65
13 ALASKA on time San Francisco 503
14 ALASKA delayed San Francisco 102
15 AM WEST on time San Francisco 320
16 AM WEST delayed San Francisco 129
17 ALASKA on time Seattle 1841
18 ALASKA delayed Seattle 305
19 AM WEST on time Seattle 201
20 AM WEST delayed Seattle 61
Next, I used the spread function to have “delayed” and “on time” be displayed as columns:
flight.status.spread <- flight.status.gather %>% spread(Status, Count)
flight.status.spread# A tibble: 10 x 4
Airline Destination delayed `on time`
<fct> <chr> <int> <int>
1 ALASKA Los Angeles 62 497
2 ALASKA Phoenix 12 221
3 ALASKA San Diego 20 212
4 ALASKA San Francisco 102 503
5 ALASKA Seattle 305 1841
6 AM WEST Los Angeles 117 694
7 AM WEST Phoenix 415 4840
8 AM WEST San Diego 65 383
9 AM WEST San Francisco 129 320
10 AM WEST Seattle 61 201
Next, I would like to create a new column to show the total flights for each airline and destination combination, followed by the percentage of flights that are delayed. This information will allow us to see which destinations are most popular, and/or if certain airlines are preferred for certain destinations, as well as they delay percentage. In order to accomplis this, I used the mutate function:
flight.status.mutate1 <- flight.status.spread %>% mutate(Total_Flights = flight.status.spread$delayed + flight.status.spread$'on time')
flight.status.final <- flight.status.mutate1 %>% mutate(Percent_Delayed = flight.status.mutate1$delayed/flight.status.mutate1$Total_Flights*100)
flight.status.final# A tibble: 10 x 6
Airline Destination delayed `on time` Total_Flights Percent_Delayed
<fct> <chr> <int> <int> <int> <dbl>
1 ALASKA Los Angeles 62 497 559 11.1
2 ALASKA Phoenix 12 221 233 5.15
3 ALASKA San Diego 20 212 232 8.62
4 ALASKA San Francisco 102 503 605 16.9
5 ALASKA Seattle 305 1841 2146 14.2
6 AM WEST Los Angeles 117 694 811 14.4
7 AM WEST Phoenix 415 4840 5255 7.90
8 AM WEST San Diego 65 383 448 14.5
9 AM WEST San Francisco 129 320 449 28.7
10 AM WEST Seattle 61 201 262 23.3
Finally, I created two additional tables comparing the airlines as a whole and comparing the destination delays. I used the group_by function to group by city and airliine, as well as using the summarise function to get the mean.
flight.status.location <- flight.status.final %>% group_by(Destination) %>% summarise(Average_Percent_Delayed = mean(Percent_Delayed))
flight.status.location# A tibble: 5 x 2
Destination Average_Percent_Delayed
<chr> <dbl>
1 Los Angeles 12.8
2 Phoenix 6.52
3 San Diego 11.6
4 San Francisco 22.8
5 Seattle 18.7
flight.status.airline <- flight.status.final %>% group_by(Airline) %>% summarise(Average_Percent_Delayed = mean(Percent_Delayed))
flight.status.airline# A tibble: 2 x 2
Airline Average_Percent_Delayed
<fct> <dbl>
1 ALASKA 11.2
2 AM WEST 17.8
With these final three tables, we can move on to exploring the data.
First, I wanted to plot the number of flights to each destination as a function of airline. The first point that stands out to me are the total number of flights to Phoenix, and how AM WEST virtually dominates this destination. The opposite is true for Seattle where ALASKA dominates this hub.
The more interesting figure is the percentage of flights that were delayed to each destination as a function of airline. It is evident that ALASKA has lower flight delays at every single destination. There are other factors such as cost and customer service, but this piece of information could sway a consumer into picking ALASKA over AM WEST.
gginit.count <- ggplot(flight.status.final,aes(factor(Destination),Total_Flights,fill=Airline))
gginit.delay <- ggplot(flight.status.final,aes(factor(Destination),Percent_Delayed,fill=Airline))
plottype.count <- geom_bar(stat="identity")
plottype.delay <- geom_bar(stat="identity",position="dodge")
plottheme <- theme_bw() + theme(axis.ticks.y=element_blank(),axis.text.y = element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank())
count.plot <- gginit.count + plottype.count + plottheme + xlab('Destination') + ylab('Number of Flights') + ggtitle('Total Number of Flights')
delay.plot <- gginit.delay + plottype.delay + plottheme + xlab('Destination') + ylab('Percent Delayed') + ggtitle('Percent Delayed')
plot_grid(count.plot,delay.plot,align="v",nrow=2, rel_heights=c(1/2,1/2))Next, I wanted to display the last two data frames created in the last section in two tables. The first table analyzes the average percent delayed for flights to each destination. It shows that the flights to Phoenix are the least delayed, while flights to San Francisco are the most delayed. The second table reiterates what we saw in the ‘Percent Delayed’ bar graph, where flgihts from ALASKA airlines are less delayed than AM WEST.
datatable(flight.status.location,rownames=FALSE)datatable(flight.status.airline,rownames=FALSE)