Data manipulation is one of the most important part of Data Science. The purpose of this assignment is to perform data manipulation using R packages tidyr and dplyr. Data manipulation involves data rearrangement, manipulation and its analysis to make it ready for applicable model.
We have been provided the data for 2 airlines that describes arrival delays for both of them across five destinations.The task is to create a csv file with the given data and use R libraries tidyr and dplyr as needed to tidy and transform data and then perform analysis to compare the arrival delays for the two airlines.
The R packages used for the solution are as below. * dplyr * tidyr
Using read.csv function we populated flights_df from my github repository https://raw.githubusercontent.com/amit-kapoor/data607/master/week5/flighdetails.csv. We first dropped the blank row from data and then used gather function from tidyr package to gather data in City and Flight Count and then used arrange function from dplyr package by Airline.
Next using spread function from tidyr package, spread the data along arrival to make it wide and then rename columns. Then We used mutate function from dplyr package created new columns Delayed_Perc and OnTime_Perc columns.
We used all these functions to have a final table structure to draw analysis graphs. As graphs needs to be plotted for two different airlines We subset table for two airlines. Alaska and AM West.
# read the data from csv
flights_df <- read.csv("https://raw.githubusercontent.com/amit-kapoor/data607/master/week5/flighdetails.csv")
flights_df
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
We do have NAs in the 3rd row. Since its a small dataset, we can simply delete the NAs from 3rd row.
# drop row having NAs and rename columns
flights_df <- flights_df %>%
drop_na() %>%
rename(Airline=X, Arrival=X.1, "Los Angeles"=Los.Angeles, "San Diego"=San.Diego, "San Francisco"=San.Francisco)
flights_df
## Airline Arrival Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM West on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
From above its visible that the rows with delayed arrival do not have airline populated. Here we know the 2nd row is for Alaska and 4th for AM West.
# Replace the blank value in the 2nd row Airline column with Alaska
flights_df$Airline[2] <- "Alaska"
# Replace the blank value in the 4th row Airline column with AM West
flights_df$Airline[4] <- "AM West"
flights_df
## Airline Arrival Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 4 AM West on time 694 4840 383 320 201
## 5 AM West delayed 117 415 65 129 61
In the next few steps we are going to make data available for analysis using gather, arrange, spread, rename and mutate functions. We will add 2 new columns for %delay and %ontime analysis.
# Gather data in City and Flight Count and then arrange by Airline
flights_df <- flights_df %>%
gather("City", "Flight_Count", 3:7) %>%
arrange(Airline)
flights_df
## Airline Arrival City Flight_Count
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 Alaska on time Phoenix 221
## 4 Alaska delayed Phoenix 12
## 5 Alaska on time San Diego 212
## 6 Alaska delayed San Diego 20
## 7 Alaska on time San Francisco 503
## 8 Alaska delayed San Francisco 102
## 9 Alaska on time Seattle 1841
## 10 Alaska delayed Seattle 305
## 11 AM West on time Los Angeles 694
## 12 AM West delayed Los Angeles 117
## 13 AM West on time Phoenix 4840
## 14 AM West delayed Phoenix 415
## 15 AM West on time San Diego 383
## 16 AM West delayed San Diego 65
## 17 AM West on time San Francisco 320
## 18 AM West delayed San Francisco 129
## 19 AM West on time Seattle 201
## 20 AM West delayed Seattle 61
# spread along arrival to make it wide and then rename columns
flights_df <- flights_df %>%
spread("Arrival", "Flight_Count") %>%
rename(Delayed="delayed", OnTime="on time")
flights_df
## Airline City Delayed OnTime
## 1 Alaska Los Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San Diego 20 212
## 4 Alaska San Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM West Los Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San Diego 65 383
## 9 AM West San Francisco 129 320
## 10 AM West Seattle 61 201
# mutating new columns Delayed_Perc and OnTime_Perc
flights_df <- flights_df %>%
mutate(total_count = Delayed+OnTime) %>%
mutate(Delayed_Perc = (Delayed/total_count)*100) %>%
mutate(OnTime_Perc = (OnTime/total_count)*100) %>%
select(1:7, -5) # dropped total_count column
flights_df
## Airline City Delayed OnTime Delayed_Perc OnTime_Perc
## 1 Alaska Los Angeles 62 497 11.091234 88.90877
## 2 Alaska Phoenix 12 221 5.150215 94.84979
## 3 Alaska San Diego 20 212 8.620690 91.37931
## 4 Alaska San Francisco 102 503 16.859504 83.14050
## 5 Alaska Seattle 305 1841 14.212488 85.78751
## 6 AM West Los Angeles 117 694 14.426634 85.57337
## 7 AM West Phoenix 415 4840 7.897241 92.10276
## 8 AM West San Diego 65 383 14.508929 85.49107
## 9 AM West San Francisco 129 320 28.730512 71.26949
## 10 AM West Seattle 61 201 23.282443 76.71756
For Data analysis, subsetting the data for both airlines Alaska and AM West for further analysis.
# create dataframe for Alaska
alaska_df <- subset(flights_df, flights_df$Airline == "Alaska")
# create dataframe for AM West
amwest_df <- subset(flights_df, flights_df$Airline == "AM West")
# summary for alaska airline data
summary(alaska_df)
## Airline City Delayed OnTime
## :0 Length:5 Min. : 12.0 Min. : 212.0
## Alaska :5 Class :character 1st Qu.: 20.0 1st Qu.: 221.0
## AM West:0 Mode :character Median : 62.0 Median : 497.0
## Mean :100.2 Mean : 654.8
## 3rd Qu.:102.0 3rd Qu.: 503.0
## Max. :305.0 Max. :1841.0
## Delayed_Perc OnTime_Perc
## Min. : 5.150 Min. :83.14
## 1st Qu.: 8.621 1st Qu.:85.79
## Median :11.091 Median :88.91
## Mean :11.187 Mean :88.81
## 3rd Qu.:14.212 3rd Qu.:91.38
## Max. :16.860 Max. :94.85
# summary for am west airline data
summary(amwest_df)
## Airline City Delayed OnTime
## :0 Length:5 Min. : 61.0 Min. : 201
## Alaska :0 Class :character 1st Qu.: 65.0 1st Qu.: 320
## AM West:5 Mode :character Median :117.0 Median : 383
## Mean :157.4 Mean :1288
## 3rd Qu.:129.0 3rd Qu.: 694
## Max. :415.0 Max. :4840
## Delayed_Perc OnTime_Perc
## Min. : 7.897 Min. :71.27
## 1st Qu.:14.427 1st Qu.:76.72
## Median :14.509 Median :85.49
## Mean :17.769 Mean :82.23
## 3rd Qu.:23.282 3rd Qu.:85.57
## Max. :28.731 Max. :92.10
Will all the data munging performed above, we are ready to draw data analysis now for Delayed arrival by seeing side by side comparison for both the airlines. First we created matrix for delayed percentage data for both the airlines and then draw graphs to further analyze.
# create matrix with Delayed_Perc by Airline
arrival_delay <- matrix(c(alaska_df$Delayed_Perc,
amwest_df$Delayed_Perc), nrow = 2, ncol = 5, byrow = T)
# rename columns as City Names
colnames(arrival_delay) <- amwest_df$City
arrival_delay
## Los Angeles Phoenix San Diego San Francisco Seattle
## [1,] 11.09123 5.150215 8.62069 16.85950 14.21249
## [2,] 14.42663 7.897241 14.50893 28.73051 23.28244
Using barplot function bar graph is plotted for side by side comparison for Arrival Delay of both airlines.
# Draw barplot
barplot(arrival_delay,
main= "Airlines Comparison for Arrival Delay",
beside=TRUE,
legend.text=TRUE,
col = c("grey", "darkblue"),
xlab = "Cities",
ylab = "% Arrival Delay")
legend("topleft", legend=c("Alaska", "AM West"), fill = c("grey", "darkblue"))
Using boxplot function bar graph for Arrival Delay of both airlines, we can see the delayed percentage distribution of airlines.
boxplot(alaska_df$Delayed_Perc,
amwest_df$Delayed_Perc,
names = c("Alaska", "AM West"),
ylab="Delayed Percentage")
Here is an additional analysis for OnTime arrical for both the airlines. Similar seps has been performed to create matrix for on time arrival data and then drawn the barplot and boxplot.
# create matrix with OnTime_Perc by Airline
arrival_ontime <- matrix(c(alaska_df$OnTime_Perc,
amwest_df$OnTime_Perc), nrow = 2, ncol = 5, byrow = T)
# rename columns as City Names
colnames(arrival_ontime) <- amwest_df$City
# Draw barplot
barplot(arrival_ontime,
main= "Airlines Comparison for Arrival OnTime",
beside=TRUE,
legend.text=TRUE,
col = c("grey", "darkblue"),
xlab = "Cities",
ylab = "% Arrival OnTime")
legend("topright", legend=c("Alaska", "AM West"), fill = c("grey", "darkblue"))
# boxplot for ontime arrival data
boxplot(alaska_df$OnTime_Perc,
amwest_df$OnTime_Perc,
names = c("Alaska", "AM West"),
ylab="On Time Percentage")
By analyzing the data for both the Alaska & AM West airlines, it is clear that the mean % arrival delay for Alsaka airline (62) is less that of AM West airline (117). Also from the Bar chart drawn above we can conclude that AM West airline has more % delays as compared to Alaska airline for all the cities. San Francisco has highest % delay for both the airlines.
For Ontime arrival, we can see Alaska airlines has more % arrival ontime in all the cities.