The goal of this assignment is to take some messy (untidy) airport arrival data, clean it, and then analyze it. The hope is that doing so will elucidate a number of insights regarding the arrival patterns of two different airlines.
The data is stored in a csv file here, and is imported as a R data frame in the chunk below:
raw_data <- getURL('https://raw.githubusercontent.com/williamzjasmine/CUNY_SPS_DS/master/DATA_607/Homeworks/HW4/dirty_airport_data.csv')
# File was tab delimited, hence the sep = "\t" option
df <- data.frame(read.csv(text=raw_data, sep = "\t"))
df
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The output above reveals that the data in its current format has a number of issues to be addressed:
X and X.1 labels.NA
values and empty strings).X).. characters in their names need to be replaced by a space
character. Though the . was not present in the city names
in the original csv file, they were added during the import to prevent
there being spaces in the dataframe column names.The following section works to fix each of the above issues in order.
The cell below reaplces the column names X and
X.1 with what they should actually be: airline
and arrival_status:
colnames(df)[1] <- "airline"
colnames(df)[2] <- "arrival_status"
df
## airline arrival_status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
As shown in the output above, these first two columns now have appropriate names.
The code chunk below uses the drop_na() function to
remove the third row entirely, since it contains no useful
information:
df <- drop_na(df)
df
## airline arrival_status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
As is shown in the output above, the once third row containing only
empty and NA values is no longer present in the
dataframe.
The code below fills in the empty values in the airline
column using the fill() function. This function fills a
NA cell with the value of the column immediately above it,
which in this case exactly what this dataframe needs. However, to use
the function those two empty values are first converted to
NA.
df[df==""]<-NA
df <- fill(df, airline)
df
## airline arrival_status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
As shown in the output above, the missing airline values
are now included in the dataframe.
Now that there is no missing data, the next step is to transform the
dataframe into a long format. This can be easily done using the
pivot_longer() function, and specifying in the
names_to column that the city name columns are to be melted
into a single field called city. The values in the cells
are then also melted into a single field called
frequency.
df <- pivot_longer(df, cols = !c(airline, arrival_status),
names_to = 'city', values_to = 'frequency')
head(df)
## # A tibble: 6 × 4
## airline arrival_status city frequency
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
The head() of the dataframe above shows that it has been
successfully converted into a long format.
The last step is to replace the . characters in each of
the city names with spaces. This is done easily with the
str_replace_all() function:
df$city <- str_replace_all(df$city, "\\.", " ")
df
## # A tibble: 20 × 4
## airline arrival_status city frequency
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
And that’s it! The data is now in the desired form: no missing data, and long as opposed to wide.
Now that the data has been cleaned, it can be analyzed. The following
code chunk uses a combination of dplyr functions to
determine the percentage of flights that had late or timely arrivals for
each airline:
airline_df <-
df %>%
group_by(airline) %>%
summarise(
num_flights = sum(frequency),
num_delays = sum(ifelse(arrival_status=='delayed', frequency, 0))
)
airline_df <-
airline_df %>%
mutate(
num_on_time = num_flights - num_delays,
delay_rate = num_delays / num_flights,
on_time_rate = (num_flights - num_delays) / num_flights
) %>%
arrange(delay_rate)
airline_df
## # A tibble: 2 × 6
## airline num_flights num_delays num_on_time delay_rate on_time_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AM WEST 7225 787 6438 0.109 0.891
## 2 ALASKA 3775 501 3274 0.133 0.867
It looks like that while AM WEST (American West Airlines) has more
arrival delays in total, they actually had a slightly better
delay_rate than ALASKA (Alaska Airlines) by a little over
two percentage points: in other words, flights from AM WEST arrived on
time for a higher percentage of their flights.
The following cell performs the same analysis, except now aggregates the information on a city level:
city_df <-
df %>%
group_by(city) %>%
summarise(
num_flights = sum(frequency),
num_delays = sum(ifelse(arrival_status=='delayed', frequency, 0))
)
city_df <-
city_df %>%
mutate(
num_on_time = num_flights - num_delays,
delay_rate = num_delays / num_flights,
on_time_rate = (num_flights - num_delays) / num_flights
) %>%
arrange(delay_rate)
city_df
## # A tibble: 5 × 6
## city num_flights num_delays num_on_time delay_rate on_time_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Phoenix 5488 427 5061 0.0778 0.922
## 2 San Diego 680 85 595 0.125 0.875
## 3 Los Angeles 1370 179 1191 0.131 0.869
## 4 Seattle 2408 366 2042 0.152 0.848
## 5 San Francisco 1054 231 823 0.219 0.781
Looking at the delay_rates for individual cities, its
clear theres actually a pretty big range. Phoenix was the easiest city
to get to for both airlines, with about 92% of their flights arriving on
time. However, San Francisco seemed to pose some trouble, as only 78% of
flights made it there without delay. The following cell zooms in on both
of these cities’s flight info:
df %>%
filter(city == 'San Francisco' | city == 'Phoenix') %>%
group_by(city, airline) %>%
summarise(
num_flights = sum(frequency),
num_delays = sum(ifelse(arrival_status=='delayed', frequency, 0))
) %>%
mutate(
num_on_time = num_flights - num_delays,
delay_rate = num_delays / num_flights,
on_time_rate = (num_flights - num_delays) / num_flights
) %>%
arrange(city, airline)
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 7
## # Groups: city [2]
## city airline num_flights num_delays num_on_time delay_rate on_time_…¹
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Phoenix ALASKA 233 12 221 0.0515 0.948
## 2 Phoenix AM WEST 5255 415 4840 0.0790 0.921
## 3 San Francisco ALASKA 605 102 503 0.169 0.831
## 4 San Francisco AM WEST 449 129 320 0.287 0.713
## # … with abbreviated variable name ¹on_time_rate
The information above shows that while both airlines had pretty close
on_time_rates for Phoenix, AM WEST seems more to blame for
San Franciso’s overall delay_rate: only 71% of AM WEST’s
flights to San Francisco arrived on time, compared to 83% of flights
flown by ALASKA. Given that San Francisco is part of the American West,
you might assume that AM WEST could do better, but maybe that’s why they
went out of business…
The final step is to output the data so that anyone can build upon the analysis done here. This is done in the chunk below:
write.csv(df,"tournamentinfo.csv")