Week_5_assignment

Introduction

This week an exercise in tidying data. We’re provided untidy data in a form of csv format. The requirement is to tidy and transform it into an analyzable table where each row is a single observation and each column a single variable.

Methodology

We need to read the data first into R and browse the table
Use the tidyr library to tidy the data and dplyr to transform it
Perform analysis to compare arrival and delayed time for each air line

Requirements

Create a csv file.
Read information from the csv file
Compare the arrival delays for the two airlines
Publish the R-Markdown file to the Rpubs

Questions

Before starting the analysis process, we have to know what are the information we need to extract from the avaliable data. In other meaning what is the addedd business value from doing this particular analysis. For me, my questions are:

What is the status of both delay and on-time arrivals for each air line company per state? gives us an indicator which airline can be reliable or not
What is the delay rate per state?
Can we rank the airlines per state according to total number of flights?

Steps

Loading the necessary libraries

library(tidyr)
library(dplyr)
library(stringr)
library(kableExtra)
library(ggplot2)
library(magrittr)
library(dplyr)
library(scales)
library(gganimate)

Read the csv into R

The first step is to read the data into R, I wrote the data on Excel file and save it as .csv file. After that I used the function read.csv() to read the data into R.

library(kableExtra)
air_linetb <- read.csv("arrival_details.csv", header = TRUE, sep = ",")
air_linetb %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	62

As illustrated in the table, the data is messy contianing NA rows, and double title rows. We have to clean up the data first before tidy/transform the data.

Remove the NA row and empty rows

air_line_df <- as.data.frame(air_linetb)
clean_df <- air_line_df[-c(3), ]
clean_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

	X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
1	ALASKA	on time	497	221	212	503	1841
2		delayed	62	12	20	102	305
4	AM WEST	on time	694	4840	383	320	201
5		delayed	117	415	65	129	62

Rename columns for (first, second) and add airline name to the empty cells in the first column

# Rename columns
colnames(clean_df)[c(1,2)] <- c("air_line","status")
clean_df   %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

	air_line	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
1	ALASKA	on time	497	221	212	503	1841
2		delayed	62	12	20	102	305
4	AM WEST	on time	694	4840	383	320	201
5		delayed	117	415	65	129	62

# fill empty cells with the appropriate airline name
clean_df[c(2,2),1] <- "ALASKA"
clean_df[c(4,4),1] <- "AM WEST"
clean_df  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

	air_line	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
1	ALASKA	on time	497	221	212	503	1841
2	ALASKA	delayed	62	12	20	102	305
4	AM WEST	on time	694	4840	383	320	201
5	AM WEST	delayed	117	415	65	129	62

clean_df

The table at this shape cannot be analyzed, we have to tidy the data first. To do so, we can use the tidyr package in R. We need also to transpose the table.

We can start by generating two columns for the City and Flight

str(clean_df)

## 'data.frame':    4 obs. of  7 variables:
##  $ air_line     : Factor w/ 3 levels "","ALASKA","AM WEST": 2 2 3 3
##  $ status       : Factor w/ 3 levels "","delayed","on time": 3 2 3 2
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 62

Using tidyr package we can tidy the data using the gather() which will create two columns for city names and flights. Here after gathering, I removed the “.” between cities’ long name

gathered_df <- clean_df %>%
  gather(City, Flight, 3:7)
gathered_df$City <- str_replace(gathered_df$City, "\\.", " ")

gathered_df  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

air_line	status	City	Flight
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20
AM WEST	on time	San Diego	383
AM WEST	delayed	San Diego	65
ALASKA	on time	San Francisco	503
ALASKA	delayed	San Francisco	102
AM WEST	on time	San Francisco	320
AM WEST	delayed	San Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	62

Then, spread the status column into two columns of delayed and on-time

spread_df <- spread(gathered_df, key = "status", value = "Flight")
spread_df  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

air_line	City	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	62	201

Analyze the data

We will get an overview of the data to explore it then will get indeep to have more insight analysis

by_city <- gathered_df %>%
             group_by(City, status) %>%
             summarise(Flight=sum(Flight))

ggplot(by_city, aes(x=City, y=Flight, fill=status)) +
        geom_bar(stat="identity", position=position_dodge()) +
        xlab("City name") + scale_y_continuous(labels=scales::comma) +
        scale_x_discrete()

We can see here that the majority of the flights status arriving Phoenix are on time >5,000 flights and almost 500 delayed flights. That’s makes us more curious to know which airline has the most flights in Phoenix. However, let’s calculate the delay rate per state first.

To calculate the delay rate, I summarize the sum of the delayed time devided on sum of the number of on-time flights inside the aggrigation function group_by

delay_rate <- spread_df %>% 
                 group_by(air_line, City) %>% 
                 summarise(delay.rate = sum(delayed)/sum(`on time`))

ggplot(delay_rate, aes(x = City, y = delay.rate,fill = City)) + 
     geom_bar(stat="identity") +
     geom_text(aes( label = scales::percent(delay.rate),
                   y= delay.rate ), stat= "identity", vjust = -.3) +
     labs(y = "Percent", fill="City") +
     facet_grid( ~ air_line) +
     scale_y_continuous(labels = scales::percent) +
     theme(axis.text.x = element_text(angle=65, vjust=0.65))

It seems that AM WEST has more delays rate especially for San Franscisco and Seattle with 40.3% and 30.8% respectively compared to ALASKA which has almost half these rates.

Then we have to get the total flights per state

total_flights <- mutate(spread_df, total_num = delayed + `on time`)
total_flights %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

air_line	City	delayed	on time	total_num
ALASKA	Los Angeles	62	497	559
ALASKA	Phoenix	12	221	233
ALASKA	San Diego	20	212	232
ALASKA	San Francisco	102	503	605
ALASKA	Seattle	305	1841	2146
AM WEST	Los Angeles	117	694	811
AM WEST	Phoenix	415	4840	5255
AM WEST	San Diego	65	383	448
AM WEST	San Francisco	129	320	449
AM WEST	Seattle	62	201	263

After that, get the highest number per airline company

# get mean first then get the total number of flights that is bigger than this number 
highest_num <- total_flights %>% 
  select(air_line, City, total_num) %>%
  filter(total_num > mean(total_flights$total_num)) %>%
  arrange(desc(total_num))
highest_num %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

air_line	City	total_num
AM WEST	Phoenix	5255
ALASKA	Seattle	2146

It seams that AM WEST is domenating in Phoenix while ALASKA is in Seattle. This comparasion should be clear if we create a ranking system to rank air-lines in each state

ranking <- total_flights %>% 
  mutate(air_rank = dense_rank(total_num)) %>%
  select(air_line, City, total_num, air_rank) %>% 
  arrange((air_rank))
# Draw plot

ggplot(ranking, aes(x=City, y=air_rank, fill = air_line)) + 
  geom_bar(stat="identity", width=3, position = position_dodge()) + 
  labs(fill="air lines") +
  labs(title="Ordered Bar Chart", 
       subtitle="Ranking", 
       caption="source: ranking") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  facet_grid( ~ City)

From all the previous analysis, we may conclude that AM WEST is the best company in Phoenix; however, ALASKA is the best on Seattle based on the total number of flights.

Conclusion

To have a final conclusion for this analysis, we need more information like tickets prices for instance. However, the preliminary analysis showed that AM WEST is more reputable than ALASKA it has larger number of flights and higher ranking.

Important links

GitHub : GitHub link

csv file: CSV