Let’s take a first look at what the data looks like before we clean it up.
untidyData <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Week%205%20-%20Untidy%20Data/untidyData.csv")
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
untidyData
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
So it looks like we have a variable in airline, a variable in arrival (on time vs delayed), a variable in what I’m assuming is a count of flights as the cell values under the destination cities and there’s also the variable of destination itself.
At the end I generate a summary count of total flights with delayed arrival for each airline, and print out the tidy dataframe.
untidyData[2,1] <- 'Alaska'
untidyData <- untidyData[-3,]
untidyData[4,1] <- 'AM WEST'
pivotLong <- untidyData %>% pivot_longer(cols = c('Los Angeles','Phoenix','San Diego','San Francisco', 'Seattle'), names_to= 'Destination',values_to = 'Count')
pivotLong <- rename(pivotLong, 'airline' = '...1' )
pivotLong <- rename(pivotLong, 'arrivalCategory' = '...2')
pivotLong
## # A tibble: 20 × 4
## airline arrivalCategory Destination Count
## <chr> <chr> <chr> <dbl>
## 1 Alaska on time Los Angeles 497
## 2 Alaska on time Phoenix 221
## 3 Alaska on time San Diego 212
## 4 Alaska on time San Francisco 503
## 5 Alaska on time Seattle 1841
## 6 Alaska delayed Los Angeles 62
## 7 Alaska delayed Phoenix 12
## 8 Alaska delayed San Diego 20
## 9 Alaska delayed San Francisco 102
## 10 Alaska delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
# names(pivotLong)[1] <- 'airline'
# names(pivotLong)[2] <- 'arrivalCategory'
pivotLong %>% group_by(airline) %>% filter(arrivalCategory == 'delayed') %>% summarize(sum(Count))
## # A tibble: 2 × 2
## airline `sum(Count)`
## <chr> <dbl>
## 1 Alaska 501
## 2 AM WEST 787
For a simple ggplot command, I went ahead and filtered some data to create a new dataframe so that there were no inline computations needed.
delays <- pivotLong %>% filter(arrivalCategory == 'delayed')
delays
## # A tibble: 10 × 4
## airline arrivalCategory Destination Count
## <chr> <chr> <chr> <dbl>
## 1 Alaska delayed Los Angeles 62
## 2 Alaska delayed Phoenix 12
## 3 Alaska delayed San Diego 20
## 4 Alaska delayed San Francisco 102
## 5 Alaska delayed Seattle 305
## 6 AM WEST delayed Los Angeles 117
## 7 AM WEST delayed Phoenix 415
## 8 AM WEST delayed San Diego 65
## 9 AM WEST delayed San Francisco 129
## 10 AM WEST delayed Seattle 61
One could make an off the cuff evaluation and use Alaska for all destinations except Seattle.
ggplot(delays,aes(x=Destination,y=Count)) + geom_col(aes(fill = airline), position = "dodge")