DATA 607: Tidying Data

Author

Gabriel Castellanos

Published

February 26, 2023

Introduction

The goal of the project is to:

  1. Load data into a data frame from SQL Server database that includes data about arriving flights in the United States. The source code to populate the tables can be found here: Flights DB

  2. Once we load the data from the database into R, tidying/cleaning of the data needs to be performed.

  3. Using data visualizations, use GG Plot to compare the ‘On-Time’ and ‘Delayed’ flights.

First we install the required libraries/packages

package 'odbc' successfully unpacked and MD5 sums checked
package 'RODBC' successfully unpacked and MD5 sums checked
package 'sqldf' successfully unpacked and MD5 sums checked
package 'ggplot2' successfully unpacked and MD5 sums checked
package 'plotly' successfully unpacked and MD5 sums checked
package 'dplyr' successfully unpacked and MD5 sums checked
package 'hrbrthemes' successfully unpacked and MD5 sums checked
Installing viridis [0.6.2] ...
    OK [linked cache]

Now we establish a connection to the database.

con <- odbcDriverConnect(connection = "Driver= {SQL Server Native Client 11.0};server=localhost;database=DBex;trusted_connection=yes")

Next, we query the data that we want:

queryresult <- sqlQuery(con, "select Airlines.Airline_Name, Arriving_Flights.City, Arriving_Flights.Flight_Status, Arriving_Flights.Flight_Time_minuts, Arriving_Flights.ID FROM Arriving_Flights INNER JOIN Airlines ON Arriving_Flights.ID = Airlines.ID;")

Now that we have the data loaded, we can commence the cleaning process.

  • Flight time minutes are mislabeled and should be renamed the number of flights.

  • We also need to convert character variables into factor in order to conduct analysis

  • We also do not want the ID column.

queryresult$ID <- NULL
names(queryresult)[4] ="Number of Flights"
#Data Conversion
queryresult$Airline_Name <- as.factor(queryresult$Airline_Name)
queryresult$Flight_Status <- as.factor(queryresult$Flight_Status)

queryresult$City <- as.factor(queryresult$City)

The data is now said to be ‘tidy’ as it meets each of the following 3 criteria:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table. 1

    • Since the data is now considered tidy, we can now compare the ‘On-Time’ and delayed groups between the two airlines.
Alaska <- subset(queryresult, Airline_Name == 'Alaska')

AM.West <- subset(queryresult, Airline_Name == 'AM West')


Alaska %>%
  ggplot( aes(x=City, y=`Number of Flights`, group=Flight_Status, color=Flight_Status)) +
    geom_line() +
  coord_flip() +
    scale_color_viridis(discrete = TRUE) +
    ggtitle("Alaska Airlines: On-Time vs Delayed") +
    theme_ipsum() +
    ylab("Number of Flights")

AM.West |> 
  ggplot( aes(x=City, y=`Number of Flights`, group=Flight_Status, color=Flight_Status)) +
    geom_line() +
  coord_flip() +
    scale_color_viridis(discrete = TRUE) +
    ggtitle("AM West Airlines: On-Time vs Delayed") +
    theme_ipsum() +
    ylab("Number of Flights")

Number of Flights for Alaska airlines: On-Time vs Delayed

Number of Flights for AM West airlines: On-Time vs Delayed

Flight Status comparison for AM West and Alaska Airlines.

Results

From the graphs we can see that a few takeaways are immediately apparent:

  1. They were a significantly greater amount of flights that were on time rather than delayed– when looking at either airline (not surprising).

  2. For AM West: the number of on-time flights and delayed flights are comparable, except when looking at Phoenix where there is a large number of on-time flights. This could be due to the airline getting a lot of traffic in this area. A similar observation can be seen with Alaska airlines and traveling to Seattle.

  3. Due to a large number of flights going to Seattle (for Alaska) and Phoenix (for AM West), the number of delayed flights that travel to these cities is also fairly large.

Footnotes

  1. Wickham, Hadley. Tidy Data, Journal of Statistical Software, Volume VV, Issue II↩︎