We were given an data set of Arrival Airport delays between some of the big cities. The data set has a wide structure that needs some manipulation to transform it into long form. We load the csv file then rename the empty column name. We deleted all the empty row and reprocess the data inton a long structure.
# read data csv file
raw <- read.csv("https://raw.githubusercontent.com/joewarner89/CUNY-607/main/homeworks/Assignment%205/airport_data.csv", header = T, stringsAsFactors = F)
# Delete all Empty rows
raw <- raw %>% na.omit(raw)
raw <- as.data.frame(raw)
# rename row
raw <- raw %>% rename(Airlines = 1, Arrival_Status = 2)
head(raw)
## Airlines Arrival_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
The layout of this data set require data manipulation to transform this data set in the long form.
## 'data.frame': 20 obs. of 4 variables:
## $ Airlines : chr "Alaska" "" "" "AM WEST" ...
## $ Arrival_Status: chr "on time" "delayed" "on time" "delayed" ...
## $ City : chr "Los.Angeles" "Los.Angeles" "Los.Angeles" "Los.Angeles" ...
## $ Arrival_Delays: int 497 62 694 117 221 12 4840 415 212 20 ...
## Airlines Arrival_Status City Arrival_Delays
## 1 Alaska on time Los.Angeles 497
## 2 delayed Los.Angeles 62
## 3 on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 Alaska on time Phoenix 221
## 6 delayed Phoenix 12
We are going to replace the dot(.) in the City Column with a space.
airline_data$City <- str_replace(airline_data$City, "\\.", " ")
head(airline_data)
## Airlines Arrival_Status City Arrival_Delays
## 1 Alaska on time Los Angeles 497
## 2 delayed Los Angeles 62
## 3 on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 delayed Phoenix 12
We will fill the value of NA with the most recent value that above the empty strings.
final <- airline_data %>% mutate(Airlines = as.character(na_if(Airlines,""))) %>% fill(Airlines,.direction = 'down')
head(final)
## Airlines Arrival_Status City Arrival_Delays
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 Alaska on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
head(final)
## Airlines Arrival_Status City Arrival_Delays
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 Alaska on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
final %>% group_by(Airlines,City,Arrival_Status) %>% summarise(Delay_total = sum(Arrival_Delays))
## `summarise()` has grouped output by 'Airlines', 'City'. You can override using
## the `.groups` argument.
## # A tibble: 15 × 4
## # Groups: Airlines, City [10]
## Airlines City Arrival_Status Delay_total
## <chr> <chr> <chr> <int>
## 1 AM WEST Los Angeles delayed 117
## 2 AM WEST Phoenix delayed 415
## 3 AM WEST San Diego delayed 65
## 4 AM WEST San Francisco delayed 129
## 5 AM WEST Seattle delayed 61
## 6 Alaska Los Angeles delayed 62
## 7 Alaska Los Angeles on time 1191
## 8 Alaska Phoenix delayed 12
## 9 Alaska Phoenix on time 5061
## 10 Alaska San Diego delayed 20
## 11 Alaska San Diego on time 595
## 12 Alaska San Francisco delayed 102
## 13 Alaska San Francisco on time 823
## 14 Alaska Seattle delayed 305
## 15 Alaska Seattle on time 2042
air <- final %>% filter(City %in% c("Los Angeles","Seattle") )
# Only 2 Airport
air <- final %>% filter(City %in% c("Los Angeles","Seattle") )
head(air)
## Airlines Arrival_Status City Arrival_Delays
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 Alaska on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 Alaska on time Seattle 1841
## 6 Alaska delayed Seattle 305
ggplot(air, aes(x=City, y=Arrival_Delays, fill=Arrival_Status)) +
geom_bar(stat='identity') +
theme_bw()