Introduction

For this week’s assignment, we dived into tidy data. Oftentimes we will receive data from a wide range of sources and most of the time it requires cleaning. A part of cleaning includes tidying the data into a more usable structure where each column and row in the dataset represents a variable and observation, respectively. I will use tidyr and dplyr to tidy and transform the data about airlines and then perform analysis to draw conclusions from the data.

#Read in csv. file

airlines<- read.csv("C:\\Users\\nakes\\OneDrive\\Documents\\DATA607\\airlines.csv")

tibble(airlines)
## # A tibble: 5 × 7
##   X         X.1       Los.Angeles Phoenix San.Diego San.Francisco Seattle
##   <chr>     <chr>           <int>   <int>     <int>         <int>   <int>
## 1 "ALASKA"  "on time"         497     221       212           503    1841
## 2 ""        "delayed"          62      12        20           102     305
## 3 ""        ""                 NA      NA        NA            NA      NA
## 4 "AM WEST" "on time"         694    4840       383           320     201
## 5 ""        "delayed"         117     415        65           129      61

Clean, Tidy, and Transform

For the tidy and transform portion, I first removed the row that was originally a blank row in our csv. and then I populated the other cells in the variable X with the appropriate airline. Most the variables were then renamed to provide a better name (X and X.1) or to a better format to transform. I then used pivot_longer to transform the data long, specifically the destinations. This made our data very long but still not in the format needed for analysis, I decided to use pivot_wider to transform the arrivals (on time and delayed) wider. The final tidy dataset is now longer but separated by arrival delays to provide an easier way to do our analysis in the next section.

#Remove row of NAs 
airlines <- airlines[c(1,2,4,5), c(1:7)]

#Populate the other cells with ALASKA and AM WEST
airlines[c(2,4),1] <- airlines[c(1,3),1]

#Rename variables and transform data longer
airlines2<- airlines %>%
  rename(airline = X,
         arrival = X.1,
         "Los Angeles" = "Los.Angeles",
         "San Diego" = "San.Diego",
         "San Francisco" = "San.Francisco") %>%
  pivot_longer(cols=c('Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'),
               names_to = "destination",
               values_to = "flights")

#Transform the arrival flights to wide
airlines3<- airlines2 %>% 
  pivot_wider(names_from = arrival, 
    values_from = flights)

tibble(airlines3)
## # A tibble: 10 × 4
##    airline destination   `on time` delayed
##    <chr>   <chr>             <int>   <int>
##  1 ALASKA  Los Angeles         497      62
##  2 ALASKA  Phoenix             221      12
##  3 ALASKA  San Diego           212      20
##  4 ALASKA  San Francisco       503     102
##  5 ALASKA  Seattle            1841     305
##  6 AM WEST Los Angeles         694     117
##  7 AM WEST Phoenix            4840     415
##  8 AM WEST San Diego           383      65
##  9 AM WEST San Francisco       320     129
## 10 AM WEST Seattle             201      61

Data Analysis

For the analysis portion, I looked at the percent of delayed flights by destination to see if there was a difference by airline. I created a variable total_flights to sum the flights on time and the flights delayed.Then I created percent_delayed to find the percent of flights delayed by destination and airline. To make it a little more easier to visualize, I created a bar plot of the percent delayed.

We can see that for every destination, the percent of delayed flights are greater for the AM WEST airline. It is especially notable for flights to San Francisco and Seattle. Therefore, to avoid delays a person might choose to fly with ALASKA over AM WEST airline.

#Create a variable of the percent of delayed flights 
airlines4 <- airlines3 %>%
  clean_names() %>%
  mutate(total_flights = on_time + delayed,
         percent_on_time = round(on_time/total_flights*100, 2),
         percent_delayed = round(delayed/total_flights*100, 2))

data.frame(airlines4)
##    airline   destination on_time delayed total_flights percent_on_time
## 1   ALASKA   Los Angeles     497      62           559           88.91
## 2   ALASKA       Phoenix     221      12           233           94.85
## 3   ALASKA     San Diego     212      20           232           91.38
## 4   ALASKA San Francisco     503     102           605           83.14
## 5   ALASKA       Seattle    1841     305          2146           85.79
## 6  AM WEST   Los Angeles     694     117           811           85.57
## 7  AM WEST       Phoenix    4840     415          5255           92.10
## 8  AM WEST     San Diego     383      65           448           85.49
## 9  AM WEST San Francisco     320     129           449           71.27
## 10 AM WEST       Seattle     201      61           262           76.72
##    percent_delayed
## 1            11.09
## 2             5.15
## 3             8.62
## 4            16.86
## 5            14.21
## 6            14.43
## 7             7.90
## 8            14.51
## 9            28.73
## 10           23.28
#Plot the percentage of delayed flights by destination 
ggplot(data=airlines4, aes(x=destination, y=percent_delayed, fill=airline)) +
  ggtitle("Proportion of Delayed Flights by Destination") +
   ylab("Percent of Delayed Flights") + xlab("Destination")+
  scale_fill_brewer(palette = "Set1") +
         geom_col(position = "dodge")

Conclusion

Data tidying is an important step in trying to draw conclusions from our data. In this assignment, we took an untidy dataframe about airlines and transformed it long then wide to produced a dataframe where we can perform analysis.

In our analysis we created new variable to look at percent of delayed flights and then graphed it. From our tidying and analysis of the data, we drew conclusions about which airline may be preferable to avoid delays flying out to LA, Phoenix, San Diego, San Francisco, or Seattle. It is best to flight out of Alaska airline because the percentage of delayed flights are lower than AM WEST for all of these destinations.

It may be useful to look at how the delays differ from flights to the East Coast by airline to see if AM WEST will still have a larger percent of delays compared to ALASKA airline.