Tidying and Transforming Data

This assignment is about creating a csv file and then transforming the data to provide analysis on two airlines flight perfmance on key Western airports.

Create the .csv file

My first step was to create the data file in Excel and save it as a csv file.

https://github.com/mjgons/DATA607/blob/master/Book1.csv

Import the .csv file

I then imported the csv file into r.

df <- read.csv("Book1.csv", header = TRUE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'Book1.csv'
head(df)
##   Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska       On Time         497     221       212           503    1841
## 2  Alaska       Delayed          62      12        20           102     305
## 3 AM West       On Time         694    4840       383           320     201
## 4 AM West       Delayed         117     415        65           129      61
df2 <- as_tibble(df)
head(df2)
## # A tibble: 4 x 7
##   Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
##   <fct>   <fct>               <int>   <int>     <int>         <int>   <int>
## 1 Alaska  On Time               497     221       212           503    1841
## 2 Alaska  Delayed                62      12        20           102     305
## 3 AM West On Time               694    4840       383           320     201
## 4 AM West Delayed               117     415        65           129      61

Transforming the data by going from wide to long

Once the data was imported, it needed to be transformed in order to provide analysis. I’m flipping the data from wide to long in order to do calculations easier.

df3 <- gather(df2, "Arrival City", "Flight Count", 3:7)
  
head(df3, 10)
## # A tibble: 10 x 4
##    Airline Flight.Status `Arrival City` `Flight Count`
##    <fct>   <fct>         <chr>                   <int>
##  1 Alaska  On Time       Los.Angeles               497
##  2 Alaska  Delayed       Los.Angeles                62
##  3 AM West On Time       Los.Angeles               694
##  4 AM West Delayed       Los.Angeles               117
##  5 Alaska  On Time       Phoenix                   221
##  6 Alaska  Delayed       Phoenix                    12
##  7 AM West On Time       Phoenix                  4840
##  8 AM West Delayed       Phoenix                   415
##  9 Alaska  On Time       San.Diego                 212
## 10 Alaska  Delayed       San.Diego                  20

More Transforming the data by breaking out the Delayed and On Time columns so I can do additional cauculations.

df4 <- spread(df3, "Flight.Status", "Flight Count")

head(df4,10)
## # A tibble: 10 x 4
##    Airline `Arrival City` Delayed `On Time`
##    <fct>   <chr>            <int>     <int>
##  1 Alaska  Los.Angeles         62       497
##  2 Alaska  Phoenix             12       221
##  3 Alaska  San.Diego           20       212
##  4 Alaska  San.Francisco      102       503
##  5 Alaska  Seattle            305      1841
##  6 AM West Los.Angeles        117       694
##  7 AM West Phoenix            415      4840
##  8 AM West San.Diego           65       383
##  9 AM West San.Francisco      129       320
## 10 AM West Seattle             61       201

I’m now going to add the total number of flighs by Arrival City and Airline and the on time rate.

df5 <- df4 %>%
  mutate("Total Flights" = (`Delayed` + `On Time`),
         "On Time Rate" = (`On Time` / (`On Time` + `Delayed`))
    
  )

head(df5,15)
## # A tibble: 10 x 6
##    Airline `Arrival City` Delayed `On Time` `Total Flights` `On Time Rate`
##    <fct>   <chr>            <int>     <int>           <int>          <dbl>
##  1 Alaska  Los.Angeles         62       497             559          0.889
##  2 Alaska  Phoenix             12       221             233          0.948
##  3 Alaska  San.Diego           20       212             232          0.914
##  4 Alaska  San.Francisco      102       503             605          0.831
##  5 Alaska  Seattle            305      1841            2146          0.858
##  6 AM West Los.Angeles        117       694             811          0.856
##  7 AM West Phoenix            415      4840            5255          0.921
##  8 AM West San.Diego           65       383             448          0.855
##  9 AM West San.Francisco      129       320             449          0.713
## 10 AM West Seattle             61       201             262          0.767

By Airline Overall

Next it was time to look at the arrival delays of the two airlines by city and overall.

Alaska Airlines Total Flights and On Time Rate We can see that Alaska Airlines had 3,775 total flights and an on time rate of 88.8%

#Alaska Airlines Total Flights and On Time Rate
df5 %>%
  filter(`Airline` == "Alaska") %>%
  summarise(total = sum(`Total Flights`), OnTimeRate = mean(`On Time Rate`))
## # A tibble: 1 x 2
##   total OnTimeRate
##   <int>      <dbl>
## 1  3775      0.888

AM West had nearly double the flights as 7,225 flights and an on time rate that was below Alaska at 82.2%.

#AM West Total Flights and On Time Rate
df5 %>%
  filter(`Airline` == "AM West") %>%
  summarise(total = sum(`Total Flights`), OnTimeRate = mean(`On Time Rate`))
## # A tibble: 1 x 2
##   total OnTimeRate
##   <int>      <dbl>
## 1  7225      0.822

By City Overall

Sorting (arrange) by city allows us to easily compare the on time rates by arrival city. We see that Alaska had a better average on time rates in all five cities in this data set.

df5 %>%
  arrange(`Arrival City`)
## # A tibble: 10 x 6
##    Airline `Arrival City` Delayed `On Time` `Total Flights` `On Time Rate`
##    <fct>   <chr>            <int>     <int>           <int>          <dbl>
##  1 Alaska  Los.Angeles         62       497             559          0.889
##  2 AM West Los.Angeles        117       694             811          0.856
##  3 Alaska  Phoenix             12       221             233          0.948
##  4 AM West Phoenix            415      4840            5255          0.921
##  5 Alaska  San.Diego           20       212             232          0.914
##  6 AM West San.Diego           65       383             448          0.855
##  7 Alaska  San.Francisco      102       503             605          0.831
##  8 AM West San.Francisco      129       320             449          0.713
##  9 Alaska  Seattle            305      1841            2146          0.858
## 10 AM West Seattle             61       201             262          0.767

On Time Rates by Airline by City

ggplot(data = df5) + 
  geom_point(mapping = aes(x = `Airline`, y = `On Time Rate`)) + 
  facet_wrap(~ `Arrival City`, nrow = 2) +
  labs(title = "On Time Rates by Airline by City") +
  theme_bw()

Conclusion

The data clearly shows that Alaska Airlines had better on time rates at all five cities. It’s 88.8% on time rate was ahead of AM West’s on time rate of 82.2%. San Francisco had the worst on time rate overall of the five airports.