Objective is to tidy and transform data in order to perform analysis to compare the arrival delays of two airlines.
File is displayed below.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Reading csv file, using [-3,] parameter to delet blank line between rows with data
air1 <- read.csv(file = 'https://raw.githubusercontent.com/bsvmelo/CUNY/master/airline_data.csv')[-3,]
head(air1)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time\n 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time\n 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Step 1: Changing column names.
Step 2: Filling blank cells with Airline name.
#Step 1
names(air1)[1:2]<-c("Airline","Status")
#Step 2
air1[2,1]<-air1[1,1]
air1[4,1]<-air1[3,1]
head(air1)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time\n 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time\n 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Step 3: Using Gather() function to transform column with city names into a new variable.
#Step 3
air2<-gather(air1[,3:7],key="City", value="No_Flights")
head(air2)
## City No_Flights
## 1 Los.Angeles 497
## 2 Los.Angeles 62
## 3 Los.Angeles 694
## 4 Los.Angeles 117
## 5 Phoenix 221
## 6 Phoenix 12
Step 4: Transpose columns Airline and Status to newly created data frame - air2.
#Step 4
air2[,3:4]<-air1[,1:2]
air2[,4]<-str_trim(air2[,4])
head(air2)
## City No_Flights Airline Status
## 1 Los.Angeles 497 ALASKA on time
## 2 Los.Angeles 62 ALASKA delayed
## 3 Los.Angeles 694 AM WEST on time
## 4 Los.Angeles 117 AM WEST delayed
## 5 Phoenix 221 ALASKA on time
## 6 Phoenix 12 ALASKA delayed
For a quick visual inspection, plot below shows Number of flights by Airline by City.
ggplot(air2, aes(fill=City, x=Airline, y=No_Flights)) + geom_col(position="stack") + ylab("Number of Flights")
Some highlights:
1: AM West has more flights than Alaska.
2: AM West has more flights to Phoenix than to any other city.
3: Alaska has more flights to Seattle than to any other city.
Another visualization, showing Number of flights by Airline by Status
ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="stack") + ylab("Number of Flights")
Conclusion: Visually, it seems that AM West might have more delayed arrivals than Alaska, in aggregate. However, we need to make each set of stacked bars the same height, for a proper comparison.
ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="fill") + ylab("Number of Flights")
My first impression is not correct! AM West is more punctual than Alaska.
I now show a multiple bar charts tiling showing Status breakdown by City by Airlines for a quick visual comparative inspection.
ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col() + ggtitle("Status by City") + facet_wrap(~City)
Some highlights:
1: AM West flies more frequently to Phoenix than any other location 2: Alaska flies more frequently to Seattle than any other location
As before, we need to make each set of stacked bars the same height, for a proper comparison.
ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="fill") + ggtitle("Status by City") + facet_wrap(~City)
Some highlights:
1: City with most delayed arrival is San Francisco, followed by Seattle, on both airlines as it seems visually.
2: As seem above, AM West biggest destination is Phoenix but this fact doesn’t translate into more delays.
3: On Alaska, biggest destination is Seattle and this seems to impact delayed frequency.
Calculation showing the percentage of delayed flights will done in the next step.
In this step, I will aggregate data in order to perform a calculation that will show which airline is more punctual.
# Total fligths by status
summary_total <- air2 %>%
group_by(Airline, Status) %>%
summarize(Total_Flights=sum(No_Flights))%>%
spread(key="Status", value="Total_Flights") %>%
mutate(total=`on time`+delayed,`On_Time`=percent(round((`on time`/total),2)),`Delayed`=percent(round((delayed/total),2))) %>%
arrange(desc(`On_Time`))
head(summary_total)
## # A tibble: 2 x 6
## # Groups: Airline [3]
## Airline delayed `on time` total On_Time Delayed
## <fct> <int> <int> <int> <chr> <chr>
## 1 AM WEST 787 6438 7225 89% 11%
## 2 ALASKA 501 3274 3775 87% 13%
1: As confirmation to the visual inspection above, AM West is indeed more punctual than Alaska by 2 percentage points!
Next analysis, will look into % Delayed by city by airline, in aggregate.
# best by City
# air2<-air2[,c(2,1,3,4)]
summary_city <- air2 %>%
group_by(City,Airline,Status) %>%
summarize(Total_Flights=sum(No_Flights))%>%
spread(key="Status", value="Total_Flights") %>%
mutate(total=`on time`+delayed,`% On_Time`=percent(round((`on time`/total),2)),`% Delayed`=percent(round((delayed/total),2))) %>%
arrange(`% On_Time`, City)
# Top 10 delayed cities
head(summary_city, n=10)
## # A tibble: 10 x 7
## # Groups: City, Airline [15]
## City Airline delayed `on time` total `% On_Time` `% Delayed`
## <chr> <fct> <int> <int> <int> <chr> <chr>
## 1 San.Francisco AM WEST 129 320 449 71% 29%
## 2 Seattle AM WEST 61 201 262 77% 23%
## 3 San.Francisco ALASKA 102 503 605 83% 17%
## 4 San.Diego AM WEST 65 383 448 85% 15%
## 5 Los.Angeles AM WEST 117 694 811 86% 14%
## 6 Seattle ALASKA 305 1841 2146 86% 14%
## 7 Los.Angeles ALASKA 62 497 559 89% 11%
## 8 San.Diego ALASKA 20 212 232 91% 9%
## 9 Phoenix AM WEST 415 4840 5255 92% 8%
## 10 Phoenix ALASKA 12 221 233 95% 5%
2: Top 3 cities with more delayed flights are San Francisco and Seattle on both AM West and Alaska, as suspected from the charts above.
On Alaska, second city with more delayed flights is also Seattle.
Looking into a similar table but grouping by Airline
summary_airline_1 <- air2 %>%
group_by(Airline,City, Status) %>%
summarize(Total_Flights=sum(No_Flights))%>%
spread(key="Status", value="Total_Flights") %>%
mutate(total=`on time`+delayed,`% On_Time`=percent(round((`on time`/total),2)),`% Delayed`=percent(round((delayed/total),2))) %>%
arrange(Airline,`% On_Time`, City)
# Top 10 delayed cities
head(summary_airline_1, n=10)
## # A tibble: 10 x 7
## # Groups: Airline, City [11]
## Airline City delayed `on time` total `% On_Time` `% Delayed`
## <fct> <chr> <int> <int> <int> <chr> <chr>
## 1 ALASKA San.Francisco 102 503 605 83% 17%
## 2 ALASKA Seattle 305 1841 2146 86% 14%
## 3 ALASKA Los.Angeles 62 497 559 89% 11%
## 4 ALASKA San.Diego 20 212 232 91% 9%
## 5 ALASKA Phoenix 12 221 233 95% 5%
## 6 AM WEST San.Francisco 129 320 449 71% 29%
## 7 AM WEST Seattle 61 201 262 77% 23%
## 8 AM WEST San.Diego 65 383 448 85% 15%
## 9 AM WEST Los.Angeles 117 694 811 86% 14%
## 10 AM WEST Phoenix 415 4840 5255 92% 8%
3: As seem before, Alaska flies more to Seattle than to any other location and has the second worst record in terms of delayed flights. On AM West, despite having more fligths to Phoenix than any other location, this is the most punctual destination.