Assignment 5 Tidying and Transforming Data

Objective is to tidy and transform data in order to perform analysis to compare the arrival delays of two airlines.

File is displayed below.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Reading csv file, using [-3,] parameter to delet blank line between rows with data
air1 <- read.csv(file = 'https://raw.githubusercontent.com/bsvmelo/CUNY/master/airline_data.csv')[-3,]

head(air1)
##         X       X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time\n         497     221       212           503    1841
## 2           delayed          62      12        20           102     305
## 4 AM WEST on time\n         694    4840       383           320     201
## 5           delayed         117     415        65           129      61

Tidying and transforming data

Step 1: Changing column names.

Step 2: Filling blank cells with Airline name.

#Step 1
names(air1)[1:2]<-c("Airline","Status")
#Step 2
air1[2,1]<-air1[1,1]
air1[4,1]<-air1[3,1]
head(air1)
##   Airline    Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time\n         497     221       212           503    1841
## 2  ALASKA   delayed          62      12        20           102     305
## 4 AM WEST on time\n         694    4840       383           320     201
## 5 AM WEST   delayed         117     415        65           129      61

Step 3: Using Gather() function to transform column with city names into a new variable.

#Step 3
air2<-gather(air1[,3:7],key="City", value="No_Flights")
head(air2)
##          City No_Flights
## 1 Los.Angeles        497
## 2 Los.Angeles         62
## 3 Los.Angeles        694
## 4 Los.Angeles        117
## 5     Phoenix        221
## 6     Phoenix         12

Step 4: Transpose columns Airline and Status to newly created data frame - air2.

#Step 4
air2[,3:4]<-air1[,1:2]
air2[,4]<-str_trim(air2[,4])

head(air2)
##          City No_Flights Airline  Status
## 1 Los.Angeles        497  ALASKA on time
## 2 Los.Angeles         62  ALASKA delayed
## 3 Los.Angeles        694 AM WEST on time
## 4 Los.Angeles        117 AM WEST delayed
## 5     Phoenix        221  ALASKA on time
## 6     Phoenix         12  ALASKA delayed

Exploratory Data Analysis

For a quick visual inspection, plot below shows Number of flights by Airline by City.

ggplot(air2, aes(fill=City, x=Airline, y=No_Flights)) + geom_col(position="stack") + ylab("Number of Flights")

Some highlights:

1: AM West has more flights than Alaska.

2: AM West has more flights to Phoenix than to any other city.

3: Alaska has more flights to Seattle than to any other city.

Another visualization, showing Number of flights by Airline by Status

ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="stack") + ylab("Number of Flights")

Conclusion: Visually, it seems that AM West might have more delayed arrivals than Alaska, in aggregate. However, we need to make each set of stacked bars the same height, for a proper comparison.

ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="fill") + ylab("Number of Flights")

My first impression is not correct! AM West is more punctual than Alaska.

EDA - by City

I now show a multiple bar charts tiling showing Status breakdown by City by Airlines for a quick visual comparative inspection.

ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col() + ggtitle("Status by City") + facet_wrap(~City)

Some highlights:

1: AM West flies more frequently to Phoenix than any other location 2: Alaska flies more frequently to Seattle than any other location

As before, we need to make each set of stacked bars the same height, for a proper comparison.

ggplot(air2, aes(fill=Status, x=Airline, y=No_Flights)) + geom_col(position="fill") + ggtitle("Status by City") + facet_wrap(~City)

Some highlights:

1: City with most delayed arrival is San Francisco, followed by Seattle, on both airlines as it seems visually.

2: As seem above, AM West biggest destination is Phoenix but this fact doesn’t translate into more delays.

3: On Alaska, biggest destination is Seattle and this seems to impact delayed frequency.

Calculation showing the percentage of delayed flights will done in the next step.

Summarizing data using piping

In this step, I will aggregate data in order to perform a calculation that will show which airline is more punctual.

# Total fligths by status
summary_total <- air2 %>%
  group_by(Airline, Status) %>%
  summarize(Total_Flights=sum(No_Flights))%>%
  spread(key="Status", value="Total_Flights") %>%
  mutate(total=`on time`+delayed,`On_Time`=percent(round((`on time`/total),2)),`Delayed`=percent(round((delayed/total),2))) %>%
  arrange(desc(`On_Time`))
head(summary_total)
## # A tibble: 2 x 6
## # Groups:   Airline [3]
##   Airline delayed `on time` total On_Time Delayed
##   <fct>     <int>     <int> <int> <chr>   <chr>  
## 1 AM WEST     787      6438  7225 89%     11%    
## 2 ALASKA      501      3274  3775 87%     13%

Conclusion 1

1: As confirmation to the visual inspection above, AM West is indeed more punctual than Alaska by 2 percentage points!

Next analysis, will look into % Delayed by city by airline, in aggregate.

# best by City
# air2<-air2[,c(2,1,3,4)]
summary_city <- air2 %>%
  group_by(City,Airline,Status) %>%
  summarize(Total_Flights=sum(No_Flights))%>%
  spread(key="Status", value="Total_Flights") %>%
  mutate(total=`on time`+delayed,`% On_Time`=percent(round((`on time`/total),2)),`% Delayed`=percent(round((delayed/total),2))) %>%
  arrange(`% On_Time`, City)
# Top 10 delayed cities
head(summary_city, n=10)
## # A tibble: 10 x 7
## # Groups:   City, Airline [15]
##    City          Airline delayed `on time` total `% On_Time` `% Delayed`
##    <chr>         <fct>     <int>     <int> <int> <chr>       <chr>      
##  1 San.Francisco AM WEST     129       320   449 71%         29%        
##  2 Seattle       AM WEST      61       201   262 77%         23%        
##  3 San.Francisco ALASKA      102       503   605 83%         17%        
##  4 San.Diego     AM WEST      65       383   448 85%         15%        
##  5 Los.Angeles   AM WEST     117       694   811 86%         14%        
##  6 Seattle       ALASKA      305      1841  2146 86%         14%        
##  7 Los.Angeles   ALASKA       62       497   559 89%         11%        
##  8 San.Diego     ALASKA       20       212   232 91%         9%         
##  9 Phoenix       AM WEST     415      4840  5255 92%         8%         
## 10 Phoenix       ALASKA       12       221   233 95%         5%

Conclusion 2

2: Top 3 cities with more delayed flights are San Francisco and Seattle on both AM West and Alaska, as suspected from the charts above.

On Alaska, second city with more delayed flights is also Seattle.

Looking into a similar table but grouping by Airline

summary_airline_1 <- air2 %>%
  group_by(Airline,City, Status) %>%
  summarize(Total_Flights=sum(No_Flights))%>%
  spread(key="Status", value="Total_Flights") %>%
  mutate(total=`on time`+delayed,`% On_Time`=percent(round((`on time`/total),2)),`% Delayed`=percent(round((delayed/total),2))) %>%
  arrange(Airline,`% On_Time`, City)
# Top 10 delayed cities
head(summary_airline_1, n=10)
## # A tibble: 10 x 7
## # Groups:   Airline, City [11]
##    Airline City          delayed `on time` total `% On_Time` `% Delayed`
##    <fct>   <chr>           <int>     <int> <int> <chr>       <chr>      
##  1 ALASKA  San.Francisco     102       503   605 83%         17%        
##  2 ALASKA  Seattle           305      1841  2146 86%         14%        
##  3 ALASKA  Los.Angeles        62       497   559 89%         11%        
##  4 ALASKA  San.Diego          20       212   232 91%         9%         
##  5 ALASKA  Phoenix            12       221   233 95%         5%         
##  6 AM WEST San.Francisco     129       320   449 71%         29%        
##  7 AM WEST Seattle            61       201   262 77%         23%        
##  8 AM WEST San.Diego          65       383   448 85%         15%        
##  9 AM WEST Los.Angeles       117       694   811 86%         14%        
## 10 AM WEST Phoenix           415      4840  5255 92%         8%

Conclusion 3

3: As seem before, Alaska flies more to Seattle than to any other location and has the second worst record in terms of delayed flights. On AM West, despite having more fligths to Phoenix than any other location, this is the most punctual destination.