Tidying the data means giving it a consistent structure. Tidy data is easier to manipulate, analyze, and visualize. A dataset is considered tidy if it follows these three rules:
To demonstrate tidying data, we can work with an example dataset
showing airline arrival delays for America West Airlines and Alaska
Airlines:
When the data is loaded into R however, it looks different from how it does above:
| …1 | …2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 210 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
This is because R does not know how to interpret empty headers and missing data, so it creates the column titles ‘…1’ and ‘…2’ and assigns the missing values as NA.
To tidy the data, we can remove the empty row, rename the columns that are titled incorrectly, and replace the incorrect “NA” values with the associated airlines.
airline_data_cleaned <- airline_data |>
remove_empty("rows") |>
rename("Airline" = "...1",
"Type" = "...2") |>
fill(`Airline`)
Then, to make sure that the data can be analyzed, the data should be pivoted longer, so that each row is an observation, depicting flight arrivals and delay by count:
airline_data_cleaned <- airline_data_cleaned |>
pivot_longer( cols = c("Los Angeles",
"Phoenix",
"San Diego",
"San Francisco",
"Seattle"),
names_to= "Cities",
values_to="Count")
To visualize and analyze the data, it may be helpful to have a column that includes the total number of flights, and pivot the data wider:
airline_data_cleaned <- airline_data_cleaned |>
pivot_wider(names_from = Type,
values_from = `Count`) |>
mutate(Total = `on time` + `delayed`) |>
rename("On Time" = "on time",
"Delayed" = "delayed") |>
mutate(Airline = case_when(Airline == "AM WEST"~"America West",
Airline == "ALASKA"~"Alaska"))
| Airline | On Time | Delayed | Total |
|---|---|---|---|
| Alaska | 3274 | 501 | 3775 |
| America West | 6447 | 787 | 7234 |
The table above shows the total numbers of flights, delayed flights, and on-time flights by airline. From this table, it is apparent that America West had more total flights.
This plot shows the flight statuses by airline. From this plot, it seems that America West has a slightly higher rate of delayed flights than Alaska Airlines.
| Airline | Median Number of Delayed Flights | Mean Number of Delayed Flights | IQR of Delayed Flights | Total Delayed Flights | Total Number of Flights |
|---|---|---|---|---|---|
| Alaska | 62 | 100.2 | 82 | 501 | 3775 |
| America West | 117 | 157.4 | 64 | 787 | 7234 |
The table above shows the median, mean, and interquartile range for delayed flights by airline. It also provides the total number of delayed flights, and total flights by airline. America West has higher median and mean values for delayed flights. The interquartile range (IQR) describes the spread of the data, and a high IQR can indicate that the data is more spread out, than a smaller IQR.
summary_ontime <- airline_data_cleaned |>
group_by(Airline) |>
summarise(`Median Number of On-Time Flights` = median(`On Time`), `Mean Number of On-Time Flights`=mean(`On Time`),`IQR of On-Time Flights` = IQR(`On Time`), `Total On-Time Flights Flights` = sum(`On Time`),`Total Number of Flights`=sum(`Total`))
knitr::kable(summary_ontime, format ="markdown")
| Airline | Median Number of On-Time Flights | Mean Number of On-Time Flights | IQR of On-Time Flights | Total On-Time Flights Flights | Total Number of Flights |
|---|---|---|---|---|---|
| Alaska | 497 | 654.8 | 282 | 3274 | 3775 |
| America West | 383 | 1289.4 | 374 | 6447 | 7234 |
The table above shows the median, mean, and interquartile range for on-time flights by airline. It also provides the total number of delayed flights, and total flights by airline. Since America West’s median value for on-time flights is lower than the mean, this indicates the distribution is right-skewed for on-time flights.
The plot above is similar to the Delayed Flights by City Plot, but adds more information through the stacked-bar chart, now incorporating information on airlines. In graph, it shows that most of American West’s delayed flights are in Phoenix, and most of Alaska’s delayed flights are in Seattle.
This plot shows the on-time flights by city and airline. Similar to the plot above, American West has the most on-time flights in Phoenix, and Alaska has the most on-time flights in Seattle.
In the plots above, it was not immediately clear which city has the highest rate of delayed flights by airline. For American West, the city with the highest rate of delayed flights is San Francisco, and for Alaska, the highest rate of delayed flights is San Francisco.