Github Repository

Introduction to Tidy Data

Tidying the data means giving it a consistent structure. Tidy data is easier to manipulate, analyze, and visualize. A dataset is considered tidy if it follows these three rules:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each value has its own cell


To demonstrate tidying data, we can work with an example dataset showing airline arrival delays for America West Airlines and Alaska Airlines: Airline Data


When the data is loaded into R however, it looks different from how it does above:

…1 …2 Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 210
NA delayed 117 415 65 129 61

This is because R does not know how to interpret empty headers and missing data, so it creates the column titles ‘…1’ and ‘…2’ and assigns the missing values as NA.


To tidy the data, we can remove the empty row, rename the columns that are titled incorrectly, and replace the incorrect “NA” values with the associated airlines.

airline_data_cleaned <- airline_data |>
                            remove_empty("rows") |>
                            rename("Airline" = "...1",
                                   "Type" = "...2") |> 
  fill(`Airline`)

Tidy Data by Pivoting Longer

Then, to make sure that the data can be analyzed, the data should be pivoted longer, so that each row is an observation, depicting flight arrivals and delay by count:

airline_data_cleaned <- airline_data_cleaned |> 
                            pivot_longer( cols = c("Los Angeles",
                                                   "Phoenix", 
                                                   "San Diego",
                                                   "San Francisco",
                                                   "Seattle"),
                                                   names_to= "Cities", 
                                                   values_to="Count") 

Pivot Wider and Calculate Totals

To visualize and analyze the data, it may be helpful to have a column that includes the total number of flights, and pivot the data wider:

airline_data_cleaned <- airline_data_cleaned |>  
pivot_wider(names_from = Type, 
            values_from = `Count`) |> 
  mutate(Total = `on time` + `delayed`) |> 
  rename("On Time" = "on time",
         "Delayed" = "delayed") |> 
  mutate(Airline = case_when(Airline == "AM WEST"~"America West",
                             Airline == "ALASKA"~"Alaska"))

Summaries of Airline Performance

Airline On Time Delayed Total
Alaska 3274 501 3775
America West 6447 787 7234

The table above shows the total numbers of flights, delayed flights, and on-time flights by airline. From this table, it is apparent that America West had more total flights.



This plot shows the flight statuses by airline. From this plot, it seems that America West has a slightly higher rate of delayed flights than Alaska Airlines.


Summary Table of Delayed Flights

Airline Median Number of Delayed Flights Mean Number of Delayed Flights IQR of Delayed Flights Total Delayed Flights Total Number of Flights
Alaska 62 100.2 82 501 3775
America West 117 157.4 64 787 7234

The table above shows the median, mean, and interquartile range for delayed flights by airline. It also provides the total number of delayed flights, and total flights by airline. America West has higher median and mean values for delayed flights. The interquartile range (IQR) describes the spread of the data, and a high IQR can indicate that the data is more spread out, than a smaller IQR.


Summary Table of On-Time Flights

summary_ontime <- airline_data_cleaned |> 
  group_by(Airline) |> 
  summarise(`Median Number of On-Time Flights` = median(`On Time`), `Mean Number of On-Time Flights`=mean(`On Time`),`IQR of On-Time Flights` = IQR(`On Time`), `Total On-Time Flights Flights` = sum(`On Time`),`Total Number of Flights`=sum(`Total`))

knitr::kable(summary_ontime, format ="markdown")
Airline Median Number of On-Time Flights Mean Number of On-Time Flights IQR of On-Time Flights Total On-Time Flights Flights Total Number of Flights
Alaska 497 654.8 282 3274 3775
America West 383 1289.4 374 6447 7234

The table above shows the median, mean, and interquartile range for on-time flights by airline. It also provides the total number of delayed flights, and total flights by airline. Since America West’s median value for on-time flights is lower than the mean, this indicates the distribution is right-skewed for on-time flights.


Delayed Flights by City (Total)

Delayed Flights by City and Airline

The plot above is similar to the Delayed Flights by City Plot, but adds more information through the stacked-bar chart, now incorporating information on airlines. In graph, it shows that most of American West’s delayed flights are in Phoenix, and most of Alaska’s delayed flights are in Seattle.

On-Time Flights by City and Airline

This plot shows the on-time flights by city and airline. Similar to the plot above, American West has the most on-time flights in Phoenix, and Alaska has the most on-time flights in Seattle.

Flight Delay Rate by City and Airline


In the plots above, it was not immediately clear which city has the highest rate of delayed flights by airline. For American West, the city with the highest rate of delayed flights is San Francisco, and for Alaska, the highest rate of delayed flights is San Francisco.