DATA 607 Assignment 5
Untidy Data
Prof. Catlin
Via Github, using the read_csv function from the
tidyverse package, I will be loading data from the article:
“Airlines.csv” recreated from Numbersense by Kaiser Fung, 2013.
The data is in CSV format and I will be loading it into a dataframe.
Additionally, since the data is recreated from a book, this data has
been created for educational purposes and is not real data.
Additionally, the data is stored in a private repository for use.
The data should provide us with the airline, the number of on time
flights, the number of delayed flights, and for each, separated by one
of five cities.
The cities are:
airlines <- read_csv("https://raw.githubusercontent.com/Aetherius-rex/DATA607RexWong/refs/heads/main/Assignment%205/Airlines.csv?token=GHSAT0AAAAAAC544NC4WHK63PBQKQKSQEVEZ6HWG3Q")
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(airlines)
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
The data does not have a very friendly column name for some of the
data, so we will rename the columns to make it easier to work
with.
First, we will rename the first column to ‘airline’ and the second
column to ‘status’. This will make it easier to understand what the data
is about.
Third, we will reformat the column names for the cities to make it
easier to work with. We will replace spaces with underscores, and make
the names lowercase for consistency both for the other columns, and with
how the pivoting later will append status into columns with
underscores.
Lastly we will rename status ‘on time’ to ‘on_time’ for consistency with
the column names.
# Rename columns
colnames(airlines) <- c("airline", "status", "los_angeles", "phoenix", "san_diego",
"san_francisco", "seattle")
# Reformat items of status column
airlines$status <- sub("on time", "on_time", airlines$status)
head(airlines)
## # A tibble: 5 × 7
## airline status los_angeles phoenix san_diego san_francisco seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on_time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on_time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
The key issue with this data is its untidy nature. The data for each
of the two airlines is stored in different rows, and the cities are
stored in different columns. This makes it difficult to analyze the data
in its current form as taking a column for a city would not return
specifically on time or delayed flights for a particular airline.
However, what it does do well is that it separates the data for each
city onto each rows, which is a good start, but we can make this
better.
Preferably, we’d want each column to not only represent a city but more
specifically on time or delayed flights respectively, therefore a city
will have two columns. This allows us to filter for on time or delayed
flights for a particular airline and city with ease. Whereas, in its
current state, we have to isolate cells of data based on three
conditions: the rows for an airline, and the row for delayed or on time
flights, and the column for a city. This is not ideal for for
programming as it may not be robust to misspelled or malformed
data.
For example, if the ‘on time’ cell is misspelled as ‘on timee’, then the
program would not be able to find the data for on time flights for a
particular airline and city. It may be easy to say that we can simply
find the pattern to the order the data is kept, in this data set delayed
is always the second row for each airline, but this may not always be
the case for real data. Should a data set be malformed and have data
with no consistent order to it, it would be difficult to filter for the
data we want. Lastly, by designing and implementing a system more robust
to either of these two problems would present a complex solution that
could be easily avoided by simply reformatting the data into a tidy
format.
# Drop third row of airlines
airlines <- airlines[-3, ]
# Impute airline names downwards
airlines <- airlines %>%
fill("airline")
Thanks to tidyr we can easily tidy the data with the
pivot_wider function. By passing the column names that
splits the key or in this case the airlines and what data we want to
pivot to new columns, we can easily reformat the data into a tidy
format.
# Tidy the data with pivot_wider
airlines_tidy <- airlines %>%
pivot_wider(names_from = "status", values_from = c("los_angeles", "phoenix",
"san_diego", "san_francisco", "seattle"))
airlines_tidy
## # A tibble: 2 × 11
## airline los_angeles_on_time los_angeles_delayed phoenix_on_time
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 497 62 221
## 2 AM WEST 694 117 4840
## # ℹ 7 more variables: phoenix_delayed <dbl>, san_diego_on_time <dbl>,
## # san_diego_delayed <dbl>, san_francisco_on_time <dbl>,
## # san_francisco_delayed <dbl>, seattle_on_time <dbl>, seattle_delayed <dbl>
Now that we have tidied the data, we can easily filter for on time or delayed flights for a particular airline and city. This makes it easier to analyze the data and find insights that may not have been possible with the data in its original form. To experiment with this, we will do a few quick subsets and summarize the data to see if we can find any interesting insights. I will demonstrate both querying the new transformed data and querying multiple columns at once to find insights.
# seattle_on_time shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_on_time, fill = airline, label = seattle_on_time)) +
geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
labs(title = "Number of Flights Made On Time at Seattle by Airlines", x = "Airline",
y = "Number of Flights On Time")
Based strictly on this graph, its clear that United Airlines has more flights made on time at Seattle than American Airlines. However that only tells us part of the story, as we do not know how many flights were made delayed at Seattle by each airline.
# seattle_delayed shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_delayed, fill = airline, label = seattle_delayed)) +
geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
labs(title = "Number of Flights Made Delayed at Seattle by Airlines", x = "Airline",
y = "Number of Flights Delayed")
Based strictly on this graph, its clear that United Airlines has more flights made delayed at Seattle than American Airlines. However, this is tricky to interpret as if we simply go off this value, its clear that Alaskan has much more delayed flights than AM. To be able to examine how either airline does in general, irrespective of the number of flights made, we can find the ratio of delayed to total flights for each airline at Seattle.
# seattle_ratio shown on bar graph in ggplot
ggplot(airlines_tidy, aes(x = airline, y = seattle_delayed/(seattle_on_time + seattle_delayed),
fill = airline, label = seattle_delayed/(seattle_on_time + seattle_delayed))) +
geom_bar(stat = "identity") + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
labs(title = "Ratio of On Time to Delayed Flights at Seattle by Airlines", x = "Airline",
y = "Ratio of Delayed to Total Flights")
This graph is the culmination of both the previous graphs, it shows the ratio of delayed flights over all flights made at Seattle by either airline. This is a good metric to see how well either airline does in general, irrespective of the number of flights made. Thankfully, the data being handled in a tidy format makes it easy to find these insights and visualize them by simply calling for the columns we want to analyze and summing them or dividing them as needed. With this graph showing overall performance, we can also apply a similar calculation to all 5 cities to see how well either airline does overall, maybe having more flights decreases the delay rate, or maybe the delay rate is consistent across all cities depending on the airline.
Like before, getting the ratios is a simple calculation performed on data for each airline at each city, however, to visualize this data, we will need to store the data in a long format to be able to visualize it. This is because we want to compare the ratio of delayed to total flights for each airline at each city, and stack them on top of each other for a stacked bar graph. This is not possible with the current format of the data, as the data is stored in a wide format where each city is a column, and cannot be addressed simultaneously as the y-axis in a ggplot graph.
# init a new column for the ratio of delayed to total for each airline at each
# city
airlines_tidy <- airlines_tidy %>%
mutate(los_angeles_ratio = los_angeles_delayed/(los_angeles_on_time + los_angeles_delayed),
phoenix_ratio = phoenix_delayed/(phoenix_on_time + phoenix_delayed), san_diego_ratio = san_diego_delayed/(san_diego_on_time +
san_diego_delayed), san_francisco_ratio = san_francisco_delayed/(san_francisco_on_time +
san_francisco_delayed), seattle_ratio = seattle_delayed/(seattle_on_time +
seattle_delayed))
# store as a long format for easier visualization
airlines_long <- airlines_tidy %>%
pivot_longer(cols = c(los_angeles_ratio, phoenix_ratio, san_diego_ratio, san_francisco_ratio,
seattle_ratio), names_to = "city", values_to = "ratio")
# visualize the data
ggplot(airlines_long, aes(x = airline, y = ratio, fill = airline, label = ratio)) +
geom_bar(stat = "identity", colour = "black") + geom_text(size = 2, position = position_stack(vjust = 0.7),
parse = TRUE) + geom_text(size = 3, position = position_stack(vjust = 0.3), check_overlap = TRUE,
aes(label = city)) + labs(title = "Ratio of On Time to Delayed Flights by Airlines at Cities",
x = "Airline", y = "Ratio of Delayed to Total Flights")
Through this graph, we can see that Alaska’s lower delay rate compared to AM is consistent across all cities. By comparing these values, to how many flights are made at each city by each airline, I cannot find a pattern based on overall flights flown that may explain Alaska’s better performance overall. However, this process of manipulating the data is only made possible by the initial efforts to tidy the data. Sure, it was necessary at this step to form a long format to visualize the data, but this is a simple step that can be done for specific analysis and is made possible by the data already being in tidy format. This is the power of tidy data, it allows for easy manipulation and analysis of data that would be difficult or impossible.
Through this process, we have taken untidy data, data that is meant for human consumption and not for processing, and turned it into tidy data, data formatted for manipulation and analysis. This processes has allows us to easy generate visualizations without complex data manipulation, and to also allow us to reformat the data for specific analysis that requires it. Like Hadley Wickham states in his paper on Tidy Data, the purpose of tidy data is its standard way of organizing data, and its easy for manipulation, modeling, and visualization. In the end, I believe that in this assignment, I have demonstrated the process of converting raw data into tidy data and the benefits of doing so.