DATA 607: Assignment 5

Introduction

For this assignment we were tasked to work with untidy data. Shown below is a chart that describes arrival delays for two airlines across five destinations. Our first objective is to create a CSV file with this data. Once created, we need to use dplyr and tidyr to tidy and transform the data. Once the data is sufficiently tidied, I will then perform an analysis on the data and describe any findings that I make.

Creating the CSV File

Shown below is the process for creating comma separated values, otherwise known as a CSV. Once created, we can use R to write a CSV file for it that we can then upload elsewhere to be read back into R.

df <- data.frame(X1 = c("Alaska",NA, NA,"AM West",NA),
                 X2 = c("on time","delayed",NA,"on time","delayed"),
                 `Los Angeles` = c(497,62,NA,694,117),
                 Phoenix = as.numeric( gsub(",","",c(221,12,NA,'4,840',415))),
                 `San Diego` = c(212,20,NA,383,65),
                 `San Francisco` = c(503,102,NA,320,129),
                 Seattle =  as.numeric( gsub(",","",c('1,841',305,NA,201,61)))
                 )

write.csv(df,"flight.csv",row.names=FALSE)

flights <- read.csv("https://raw.githubusercontent.com/zachsfr/RTest/main/flight.csv")

Initial Shape of the Data

Seen below is the initial shape of data on load. Unsparingly, it is exactly the same as the original image. With it now loaded into R, we can begin discussing and then fixing the issues with our data

In fixing the data, we need to first recognize what is wrong with it. The original data implicitly labeled the first column, assuming that the reader can figure out that each airline appears twice, once for on time and a second time for delayed. In doing so, they only labeled the on time row with the airline and left the row below empty. In addition, the data uses a blank row to separate each airline.They then decided to split the cities into separate columns. All of these elements combined make the data very hard to use for analysis in R.

Tidying the Data

While the issues mentioned above prevent us from using the data right away, reshaping the data into a more R friendly format is a very simple process. In the first part of the code below, shown as the highlighted portion in the image below, we can use pivot_longer, the updated version of gather(), to pull the city columns, which were all separate, into one singular column. Once we do so, we can then use the fill() command to fill in the missing airline names with those in the previous row.

In the next portion of the code, we further tidy the data by splitting the arrival types, “on time” and “delayed”, into separate columns using the opposite of pivot_longer, pivot_wider. We then use the new columns to create two new variables that can be useful in analyzing the data, the proportion of delayed and on time arrivals. The final result is shown in the table below.

flights <- flights %>%
  pivot_longer(Los.Angeles:Seattle,
               names_to = "City",
               values_to  = "Count",
               values_drop_na=T)  %>%
  fill(X1) %>%
  pivot_wider(names_from =X2,values_from =Count) %>%

  rename(Airline = X1, On.Time = `on time`,Delayed = delayed) %>%
  mutate(delay_prop=  round(((Delayed/(Delayed + On.Time))),  3 ) ,
         on_time_prop =  round(((On.Time/(Delayed + On.Time))),  3 )
         
         )

Analyzing the Data

Part One

In this section, we will analyze the data. With the arrival delay being a numeric value, and the cities that represent them as categorical, we can cleanly visualize and study our data using a barplot with our bars split by airline. The code and resulting barplot can be seen below.

Using the barplot below can gain some significant insight into some key aspects about our data. The first and most striking detail is that the airline AM West has significantly more arrival delays at every location.

We can also see which cities have the most and least arrival delays. It appears that San Francisco has the most and that Phoneix has the last arrival delays.

 flights %>%
  ggplot(aes(x=reorder(City,delay_prop),y=delay_prop,fill=Airline)) + 
  geom_bar(stat = 'identity',position=position_dodge())   +
  geom_text(aes(label=delay_prop), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +

  labs(y = ("Arrival Delay"),x = ("City"),
      title = ("Proportion of Arrivals with Delays per City and Airline")  ) +
  scale_x_discrete(labels = c("Phoenix","San Diego","Los Angeles","Seattle","San Francisco"))+
  scale_fill_brewer(palette="Paired") +
  theme_minimal()

Part Two

When looking at the data from a different angle, we notice something bizarre occurring. In the barplot below, the per airline delayed flight arrival proportions are compared. We noted above that AM West had significantly more arrival delays when looking at arrival delays per city. However, now when we look at the overall arrival delay values for both airline, Alaska appears to have the highest delay value. What is going on?

flights %>%
  
  group_by(Airline) %>%
  summarise (m = round((sum(Delayed)/ sum(Delayed+On.Time)),3 ) )   %>%
  ggplot(aes(x=Airline, y=m ,fill=Airline)) + 
  geom_bar(stat = 'identity') +
  geom_text(aes(label=m ), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  labs(y = ("Arrival Delay"),
      title = ("Proportion of Arrivals with Delays per Airline")) +
  scale_fill_brewer(palette="Paired")+
  theme_minimal()

However, the mystery for why this is occurring is solved by looking back at our data. As seen below, in Phoenix, AM West airlines has a total number of flights that is significantly higher than the rest of the data. The overall number of on time and delayed flights in this city far outnumber the other values. When we go to sum the arrival counts for each airline, that massive number skews the final value we get for AM West.

Airline	City	On.Time	Delayed
Alaska	Los.Angeles	497	62
Alaska	Phoenix	221	12
Alaska	San.Diego	212	20
Alaska	San.Francisco	503	102
Alaska	Seattle	1841	305
AM West	Los.Angeles	694	117
AM West	Phoenix	4840	415
AM West	San.Diego	383	65
AM West	San.Francisco	320	129
AM West	Seattle	201	61

As shown in the updated barplot below, if we ignore Phoenix in our data, our expected outcome of AM West having the highest overall arrival delay proportion, comes true.

flights %>%
group_by(Airline) %>%
  filter(City != "Phoenix") %>%
  summarise (m = round((sum(Delayed)/ sum(Delayed+On.Time)),3 ) )   %>%
  ggplot(aes(x=Airline, y=m ,fill=Airline)) + 
  geom_bar(stat = 'identity') +
  geom_text(aes(label=m ), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5)+
  labs(y = ("Arrival Delay"),
      title = ("Proportion of Arrivals with Delays per Airline, Excluding Phoenix")) +
  scale_fill_brewer(palette="Paired")+
  theme_minimal()

Conclusion

This assignment had us work with oddly shaped data that won’t work well with R if left alone. However, as shown here, using dpylr and tidyr makes the process for tidying the data into proper format a simple and efficient matter.

Once we have the data reformatted, we can use ggplot to analyze and better make sense of what is happening in our data. However, as shown in this assignment, we must always be cautions about relying on the initial graphics we make. Extreme values in our data can easily skew the outcome we see and lead to false conclusions. Proper understanding of how the values in our data affect the outcome is essential.