For this week’s assignment, we dived into tidy data. Oftentimes we will receive data from a wide range of sources and most of the time it requires cleaning. A part of cleaning includes tidying the data into a more usable structure where each column and row in the dataset represents a variable and observation, respectively. I will use tidyr and dplyr to tidy and transform the data about airlines and then perform analysis to draw conclusions from the data.
#Read in csv. file
airlines<- read.csv("C:\\Users\\nakes\\OneDrive\\Documents\\DATA607\\airlines.csv")
tibble(airlines)
## # A tibble: 5 × 7
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 "ALASKA" "on time" 497 221 212 503 1841
## 2 "" "delayed" 62 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 "AM WEST" "on time" 694 4840 383 320 201
## 5 "" "delayed" 117 415 65 129 61
For the tidy and transform portion, I first removed the row that was originally a blank row in our csv. and then I populated the other cells in the variable X with the appropriate airline. Most the variables were then renamed to provide a better name (X and X.1) or to a better format to transform. I then used pivot_longer to transform the data long, specifically the destinations. This made our data very long but still not in the format needed for analysis, I decided to use pivot_wider to transform the arrivals (on time and delayed) wider. The final tidy dataset is now longer but separated by arrival delays to provide an easier way to do our analysis in the next section.
#Remove row of NAs
airlines <- airlines[c(1,2,4,5), c(1:7)]
#Populate the other cells with ALASKA and AM WEST
airlines[c(2,4),1] <- airlines[c(1,3),1]
#Rename variables and transform data longer
airlines2<- airlines %>%
rename(airline = X,
arrival = X.1,
"Los Angeles" = "Los.Angeles",
"San Diego" = "San.Diego",
"San Francisco" = "San.Francisco") %>%
pivot_longer(cols=c('Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'),
names_to = "destination",
values_to = "flights")
#Transform the arrival flights to wide
airlines3<- airlines2 %>%
pivot_wider(names_from = arrival,
values_from = flights)
tibble(airlines3)
## # A tibble: 10 × 4
## airline destination `on time` delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San Diego 383 65
## 9 AM WEST San Francisco 320 129
## 10 AM WEST Seattle 201 61
For the analysis portion, I looked at the percent of delayed flights by destination to see if there was a difference by airline. I created a variable total_flights to sum the flights on time and the flights delayed.Then I created percent_delayed to find the percent of flights delayed by destination and airline. To make it a little more easier to visualize, I created a bar plot of the percent delayed.
We can see that for every destination, the percent of delayed flights are greater for the AM WEST airline. It is especially notable for flights to San Francisco and Seattle. Therefore, to avoid delays a person might choose to fly with ALASKA over AM WEST airline.
#Create a variable of the percent of delayed flights
airlines4 <- airlines3 %>%
clean_names() %>%
mutate(total_flights = on_time + delayed,
percent_on_time = round(on_time/total_flights*100, 2),
percent_delayed = round(delayed/total_flights*100, 2))
data.frame(airlines4)
## airline destination on_time delayed total_flights percent_on_time
## 1 ALASKA Los Angeles 497 62 559 88.91
## 2 ALASKA Phoenix 221 12 233 94.85
## 3 ALASKA San Diego 212 20 232 91.38
## 4 ALASKA San Francisco 503 102 605 83.14
## 5 ALASKA Seattle 1841 305 2146 85.79
## 6 AM WEST Los Angeles 694 117 811 85.57
## 7 AM WEST Phoenix 4840 415 5255 92.10
## 8 AM WEST San Diego 383 65 448 85.49
## 9 AM WEST San Francisco 320 129 449 71.27
## 10 AM WEST Seattle 201 61 262 76.72
## percent_delayed
## 1 11.09
## 2 5.15
## 3 8.62
## 4 16.86
## 5 14.21
## 6 14.43
## 7 7.90
## 8 14.51
## 9 28.73
## 10 23.28
#Plot the percentage of delayed flights by destination
ggplot(data=airlines4, aes(x=destination, y=percent_delayed, fill=airline)) +
ggtitle("Proportion of Delayed Flights by Destination") +
ylab("Percent of Delayed Flights") + xlab("Destination")+
scale_fill_brewer(palette = "Set1") +
geom_col(position = "dodge")
Data tidying is an important step in trying to draw conclusions from our data. In this assignment, we took an untidy dataframe about airlines and transformed it long then wide to produced a dataframe where we can perform analysis.
In our analysis we created new variable to look at percent of delayed flights and then graphed it. From our tidying and analysis of the data, we drew conclusions about which airline may be preferable to avoid delays flying out to LA, Phoenix, San Diego, San Francisco, or Seattle. It is best to flight out of Alaska airline because the percentage of delayed flights are lower than AM WEST for all of these destinations.
It may be useful to look at how the delays differ from flights to the East Coast by airline to see if AM WEST will still have a larger percent of delays compared to ALASKA airline.