This week an exercise in tidying data. We’re provided untidy data in a form of csv format. The requirement is to tidy and transform it into an analyzable table where each row is a single observation and each column a single variable.
Before starting the analysis process, we have to know what are the information we need to extract from the avaliable data. In other meaning what is the addedd business value from doing this particular analysis. For me, my questions are:
What is the status of both delay and on-time arrivals for each air line company per state? gives us an indicator which airline can be reliable or not
What is the delay rate per state?
Can we rank the airlines per state according to total number of flights?
Loading the necessary libraries
library(tidyr)
library(dplyr)
library(stringr)
library(kableExtra)
library(ggplot2)
library(magrittr)
library(dplyr)
library(scales)
library(gganimate)
The first step is to read the data into R, I wrote the data on Excel file and save it as .csv file. After that I used the function read.csv() to read the data into R.
library(kableExtra)
air_linetb <- read.csv("arrival_details.csv", header = TRUE, sep = ",")
air_linetb %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
delayed | 62 | 12 | 20 | 102 | 305 | |
NA | NA | NA | NA | NA | ||
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
delayed | 117 | 415 | 65 | 129 | 62 |
As illustrated in the table, the data is messy contianing NA rows, and double title rows. We have to clean up the data first before tidy/transform the data.
air_line_df <- as.data.frame(air_linetb)
clean_df <- air_line_df[-c(3), ]
clean_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle | |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | delayed | 62 | 12 | 20 | 102 | 305 | |
4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | delayed | 117 | 415 | 65 | 129 | 62 |
# Rename columns
colnames(clean_df)[c(1,2)] <- c("air_line","status")
clean_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle | |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | delayed | 62 | 12 | 20 | 102 | 305 | |
4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | delayed | 117 | 415 | 65 | 129 | 62 |
# fill empty cells with the appropriate airline name
clean_df[c(2,2),1] <- "ALASKA"
clean_df[c(4,4),1] <- "AM WEST"
clean_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle | |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | AM WEST | delayed | 117 | 415 | 65 | 129 | 62 |
## 'data.frame': 4 obs. of 7 variables:
## $ air_line : Factor w/ 3 levels "","ALASKA","AM WEST": 2 2 3 3
## $ status : Factor w/ 3 levels "","delayed","on time": 3 2 3 2
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 62
Using tidyr package we can tidy the data using the gather() which will create two columns for city names and flights. Here after gathering, I removed the “.” between cities’ long name
gathered_df <- clean_df %>%
gather(City, Flight, 3:7)
gathered_df$City <- str_replace(gathered_df$City, "\\.", " ")
gathered_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | status | City | Flight |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | delayed | Los Angeles | 62 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San Diego | 212 |
ALASKA | delayed | San Diego | 20 |
AM WEST | on time | San Diego | 383 |
AM WEST | delayed | San Diego | 65 |
ALASKA | on time | San Francisco | 503 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | on time | San Francisco | 320 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 62 |
Then, spread the status column into two columns of delayed and on-time
spread_df <- spread(gathered_df, key = "status", value = "Flight")
spread_df %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | City | delayed | on time |
---|---|---|---|
ALASKA | Los Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San Diego | 20 | 212 |
ALASKA | San Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AM WEST | Los Angeles | 117 | 694 |
AM WEST | Phoenix | 415 | 4840 |
AM WEST | San Diego | 65 | 383 |
AM WEST | San Francisco | 129 | 320 |
AM WEST | Seattle | 62 | 201 |
We will get an overview of the data to explore it then will get indeep to have more insight analysis
by_city <- gathered_df %>%
group_by(City, status) %>%
summarise(Flight=sum(Flight))
ggplot(by_city, aes(x=City, y=Flight, fill=status)) +
geom_bar(stat="identity", position=position_dodge()) +
xlab("City name") + scale_y_continuous(labels=scales::comma) +
scale_x_discrete()
We can see here that the majority of the flights status arriving Phoenix are on time >5,000 flights and almost 500 delayed flights. That’s makes us more curious to know which airline has the most flights in Phoenix. However, let’s calculate the delay rate per state first.
To calculate the delay rate, I summarize the sum of the delayed time devided on sum of the number of on-time flights inside the aggrigation function group_by
delay_rate <- spread_df %>%
group_by(air_line, City) %>%
summarise(delay.rate = sum(delayed)/sum(`on time`))
ggplot(delay_rate, aes(x = City, y = delay.rate,fill = City)) +
geom_bar(stat="identity") +
geom_text(aes( label = scales::percent(delay.rate),
y= delay.rate ), stat= "identity", vjust = -.3) +
labs(y = "Percent", fill="City") +
facet_grid( ~ air_line) +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(angle=65, vjust=0.65))
It seems that AM WEST has more delays rate especially for San Franscisco and Seattle with 40.3% and 30.8% respectively compared to ALASKA which has almost half these rates.
Then we have to get the total flights per state
total_flights <- mutate(spread_df, total_num = delayed + `on time`)
total_flights %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | City | delayed | on time | total_num |
---|---|---|---|---|
ALASKA | Los Angeles | 62 | 497 | 559 |
ALASKA | Phoenix | 12 | 221 | 233 |
ALASKA | San Diego | 20 | 212 | 232 |
ALASKA | San Francisco | 102 | 503 | 605 |
ALASKA | Seattle | 305 | 1841 | 2146 |
AM WEST | Los Angeles | 117 | 694 | 811 |
AM WEST | Phoenix | 415 | 4840 | 5255 |
AM WEST | San Diego | 65 | 383 | 448 |
AM WEST | San Francisco | 129 | 320 | 449 |
AM WEST | Seattle | 62 | 201 | 263 |
After that, get the highest number per airline company
# get mean first then get the total number of flights that is bigger than this number
highest_num <- total_flights %>%
select(air_line, City, total_num) %>%
filter(total_num > mean(total_flights$total_num)) %>%
arrange(desc(total_num))
highest_num %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
air_line | City | total_num |
---|---|---|
AM WEST | Phoenix | 5255 |
ALASKA | Seattle | 2146 |
It seams that AM WEST is domenating in Phoenix while ALASKA is in Seattle. This comparasion should be clear if we create a ranking system to rank air-lines in each state
ranking <- total_flights %>%
mutate(air_rank = dense_rank(total_num)) %>%
select(air_line, City, total_num, air_rank) %>%
arrange((air_rank))
# Draw plot
ggplot(ranking, aes(x=City, y=air_rank, fill = air_line)) +
geom_bar(stat="identity", width=3, position = position_dodge()) +
labs(fill="air lines") +
labs(title="Ordered Bar Chart",
subtitle="Ranking",
caption="source: ranking") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
facet_grid( ~ City)
From all the previous analysis, we may conclude that AM WEST is the best company in Phoenix; however, ALASKA is the best on Seattle based on the total number of flights.
To have a final conclusion for this analysis, we need more information like tickets prices for instance. However, the preliminary analysis showed that AM WEST is more reputable than ALASKA it has larger number of flights and higher ranking.