Assignment 5 - Untidy Data

Let’s take a first look at what the data looks like before we clean it up.

untidyData <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Week%205%20-%20Untidy%20Data/untidyData.csv")

## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`

untidyData

## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 Alaska  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

So it looks like we have a variable in airline, a variable in arrival (on time vs delayed), a variable in what I’m assuming is a count of flights as the cell values under the destination cities and there’s also the variable of destination itself.

At the end I generate a summary count of total flights with delayed arrival for each airline, and print out the tidy dataframe.

untidyData[2,1] <- 'Alaska'
untidyData <- untidyData[-3,]
untidyData[4,1] <- 'AM WEST'

pivotLong <- untidyData %>% pivot_longer(cols = c('Los Angeles','Phoenix','San Diego','San Francisco', 'Seattle'), names_to= 'Destination',values_to = 'Count')

pivotLong <- rename(pivotLong, 'airline' = '...1' )
pivotLong <- rename(pivotLong, 'arrivalCategory' = '...2')

pivotLong

## # A tibble: 20 × 4
##    airline arrivalCategory Destination   Count
##    <chr>   <chr>           <chr>         <dbl>
##  1 Alaska  on time         Los Angeles     497
##  2 Alaska  on time         Phoenix         221
##  3 Alaska  on time         San Diego       212
##  4 Alaska  on time         San Francisco   503
##  5 Alaska  on time         Seattle        1841
##  6 Alaska  delayed         Los Angeles      62
##  7 Alaska  delayed         Phoenix          12
##  8 Alaska  delayed         San Diego        20
##  9 Alaska  delayed         San Francisco   102
## 10 Alaska  delayed         Seattle         305
## 11 AM WEST on time         Los Angeles     694
## 12 AM WEST on time         Phoenix        4840
## 13 AM WEST on time         San Diego       383
## 14 AM WEST on time         San Francisco   320
## 15 AM WEST on time         Seattle         201
## 16 AM WEST delayed         Los Angeles     117
## 17 AM WEST delayed         Phoenix         415
## 18 AM WEST delayed         San Diego        65
## 19 AM WEST delayed         San Francisco   129
## 20 AM WEST delayed         Seattle          61

# names(pivotLong)[1] <- 'airline'
# names(pivotLong)[2] <- 'arrivalCategory'



pivotLong %>% group_by(airline) %>% filter(arrivalCategory == 'delayed') %>% summarize(sum(Count))

## # A tibble: 2 × 2
##   airline `sum(Count)`
##   <chr>          <dbl>
## 1 Alaska           501
## 2 AM WEST          787

For a simple ggplot command, I went ahead and filtered some data to create a new dataframe so that there were no inline computations needed.

delays <- pivotLong %>% filter(arrivalCategory == 'delayed')
delays

## # A tibble: 10 × 4
##    airline arrivalCategory Destination   Count
##    <chr>   <chr>           <chr>         <dbl>
##  1 Alaska  delayed         Los Angeles      62
##  2 Alaska  delayed         Phoenix          12
##  3 Alaska  delayed         San Diego        20
##  4 Alaska  delayed         San Francisco   102
##  5 Alaska  delayed         Seattle         305
##  6 AM WEST delayed         Los Angeles     117
##  7 AM WEST delayed         Phoenix         415
##  8 AM WEST delayed         San Diego        65
##  9 AM WEST delayed         San Francisco   129
## 10 AM WEST delayed         Seattle          61

One could make an off the cuff evaluation and use Alaska for all destinations except Seattle.

ggplot(delays,aes(x=Destination,y=Count)) + geom_col(aes(fill = airline), position = "dodge")

Assignment 5 - Untidy Data

Daniel Craig

2023-02-25