The chart below describes arrival delays for two airlines across five destinations. Your task is to: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information below. You’re encouraged to use a “wide” structure similar to how the information appears below, so that you can practice tidying and transformations. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Perform analysis to compare the arrival delays for the two airlines. Your should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Q1: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Answer: The wide format table has been created and assigned to a variable called flight_info. First of all the rows a created the way they were in the provided image and then using rbind they were bound into a table. After creating the table it was downloaded onto the working directory using write.table function
flight_info <- rbind.data.frame(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c(NA, "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c(NA, "Delayed", 117, 415, 65, 129, 61))
write.table(flight_info, file = "flight_info.csv", sep = ",", col.names=F, row.names=F)
After the table was created and download into the local working directory it was uploaded into a Github repository from where it was loaded into the the Rmarkdown using read.csv() function
flight_info_wide <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/flight_info.csv")
knitr::kable(head(flight_info_wide))
| NA. | NA..1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | On Time | 497 | 221 | 212 | 503 | 1841 |
| NA | Delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | On Time | 694 | 4840 | 383 | 320 | 201 |
| NA | Delayed | 117 | 415 | 65 | 129 | 61 |
Q 2: Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Answer: Now in order to analyze the data, we have to transform it into more analysis friendly structure. In order to achieve that we have to transform and tidy up the data from its original wide format into a long format with the help of functions from dplyr and tidyr. First of we will work on the first two columns and will try to change that using mutate() with coalesce(). After that the data from column 3 to 7 can be gathered into flights and destination using gather(). Above steps will transform the the table into long format. We can select and arrange the information the way we want using select() and arrange().
flight_info_long <- flight_info_wide|>
mutate(alaska=NA., amwest = lag(NA.))|>
mutate(airlines= coalesce(alaska,amwest), status = NA..1)|>
gather("destinations", "flights", 3:7)|>
select(airlines:flights)|>
arrange(airlines, desc(status), destinations);
knitr::kable(head(flight_info_long))
| airlines | status | destinations | flights |
|---|---|---|---|
| ALASKA | On Time | Los.Angeles | 497 |
| ALASKA | On Time | Phoenix | 221 |
| ALASKA | On Time | San.Diego | 212 |
| ALASKA | On Time | San.Francisco | 503 |
| ALASKA | On Time | Seattle | 1841 |
| ALASKA | Delayed | Los.Angeles | 62 |
Now we can see that data has been transformed into the desired analysis friendly format. We can actually add one more step to it and spread the status columns into Delayed and On time columns by using the values from flight column with the help of spread() function.
flight_info_long <- spread(flight_info_long, status, `flights`)
knitr::kable(head(flight_info_long))
| airlines | destinations | Delayed | On Time |
|---|---|---|---|
| ALASKA | Los.Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San.Diego | 20 | 212 |
| ALASKA | San.Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los.Angeles | 117 | 694 |
Q 3: Perform analysis to compare the arrival delays for the two airlines.
Answer: Before jumping into the conclusion by doing an overall comparison of the two Airlines let us first compare them destination wise. In order to achieve that we will go ahead and make a new data frame that will contain columns for the percent of delayed and on time flights alongside total flights for each airlines for every destination which will give us a rough idea of how the airlines are performing based on destinations and the total flights flying to the same destination.
comparison_by_city <- flight_info_long|>
mutate(total_flights= (Delayed + `On Time`), percent_delay= round((Delayed / ( Delayed+ `On Time`))*100,1), percent_ontime= 100-percent_delay)|>
select(airlines,destinations,percent_delay,percent_ontime, total_flights)|>
group_by(airlines)|>
arrange(desc(percent_ontime),.by_group = TRUE)
knitr::kable(comparison_by_city)
| airlines | destinations | percent_delay | percent_ontime | total_flights |
|---|---|---|---|---|
| ALASKA | Phoenix | 5.2 | 94.8 | 233 |
| ALASKA | San.Diego | 8.6 | 91.4 | 232 |
| ALASKA | Los.Angeles | 11.1 | 88.9 | 559 |
| ALASKA | Seattle | 14.2 | 85.8 | 2146 |
| ALASKA | San.Francisco | 16.9 | 83.1 | 605 |
| AM WEST | Phoenix | 7.9 | 92.1 | 5255 |
| AM WEST | Los.Angeles | 14.4 | 85.6 | 811 |
| AM WEST | San.Diego | 14.5 | 85.5 | 448 |
| AM WEST | Seattle | 23.3 | 76.7 | 262 |
| AM WEST | San.Francisco | 28.7 | 71.3 | 449 |
We can see that Alaska airlines has a better percentage of on time arrivals to every single destination but there certain destination in which even though Alaska airlines has a better percentage but the sample size is not that significant compared to AM West. Take the flight to Phoenix for an example, Alaska airlines has an on time arrivals of 94.8% and AM West has 92.1% which suggest to take Alaska airlines when flying to Phoenix but if you compare the sample size so Alaska has only 233 flight compared to 5255 from AM West. Which makes AW West more reliable.
Now lets compare the overall performance of both airlines.At this instance we kind of have a rough idea that since Alaska airlines has a lower arrival delay to every destination so we would expect the same results in overall
overall_comparison <- flight_info_long|>
group_by(airlines)|>
select(airlines,Delayed,`On Time`)|>
summarise(total_delayed=sum(Delayed),total_ontime=sum(`On Time`), delay_percent=round(sum(Delayed)/(sum(Delayed) + sum(`On Time`)) * 100, 2), ontime_percent=round(sum(`On Time`)/(sum(Delayed) + sum(`On Time`)) * 100, 2))
knitr::kable(overall_comparison)
| airlines | total_delayed | total_ontime | delay_percent | ontime_percent |
|---|---|---|---|---|
| ALASKA | 501 | 3274 | 13.27 | 86.73 |
| AM WEST | 787 | 6438 | 10.89 | 89.11 |
But in fact we see an anti climax which is that AM west has lower delay in overall comparison than Alaska airlines. The reason being that a good amount of AM West flights were to Phoenix i.e. 5255 out of 6438, which governs the overall performance when it comes to flight delay and we can clearly see that AM West did well when the destination was Phoenix that is how it ended up over Alaska airlines
We can conclude that if the number of flights flights is low Alaska airlines performs better and if the number of flights is higher then AM West performs better. To support the statement above we can recall the performance by destination, the table was set in descending order of percent on-time arrivals as the number the of flight increases from 233 to 2146 so the delay in arrival increases. Conversely, the AM West performs better when the number of flights are higher i.e. for Phoenix which is 5255 but it soon as the number of flight goes down so did the performance figures. Based on the conclusion derived each airline has an exceptional value the does not follow along the trend with others.