Assignment Tidying and Transforming Data

Author

Rashad Long

The chart below describes arrival delays for two airlines across five destinations.

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

I constructed a (CSV) file replicating the provided table. Subsequently, I uploaded the CSV file to GitHub. Finally, I utilized the read.csv function to import the data from the CSV file.

# Read in CSV file 
my_flights <-
  read.csv(
    "https://raw.githubusercontent.com/RDLong718/DATA607-Spring24/main/DATA607-Spring2024/Assignments/Assignment%20Tidying%20and%20Transforming%20Data/tidy_flights.csv"
  )

my_flights

        X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2         delayed          62      12        20           102     305
3                          NA      NA        NA            NA      NA
4 AM WEST on time         694    4840       383           320     201
5         delayed         117     415        65           129      61

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

The CSV file read in exactly like the table provided. The presence of a null row necessitated its removal, using the na.omit() function.

# Remove null row
my_flights <- na.omit(my_flights)
my_flights

        X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2         delayed          62      12        20           102     305
4 AM WEST on time         694    4840       383           320     201
5         delayed         117     415        65           129      61

I first renamed the columns to make them more descriptive. I then transformed the data to a long format using the pivot_longer function. Since I intended on filling the Airline column down, I inserted NA for the blank values in preparation for using the fill function.(Doesn’t fill blanks, only NA values)

# Clean up Column names then transform data to long format then insert NA in preparation for using the fill down function

my_flights <- my_flights |> # Rename columns
  rename(
    "Airline" = "X",
    "Arrival" = "X.1",
    "Los Angeles" = "Los.Angeles",
    "San Diego" = "San.Diego",
    "San Francisco" = "San.Francisco"
  ) |> # Pivot data to long format
  pivot_longer(
    cols = -c("Airline", "Arrival"),
    names_to = "Destination",
    values_to = "Count"
  ) |> # Insert NA for the blank values in the Airline column
  mutate(Airline = ifelse(Airline == "", NA, Airline)) |> # Fill in the down direction
  fill(Airline)

my_flights

# A tibble: 20 × 4
   Airline Arrival Destination   Count
   <chr>   <chr>   <chr>         <int>
 1 ALASKA  on time Los Angeles     497
 2 ALASKA  on time Phoenix         221
 3 ALASKA  on time San Diego       212
 4 ALASKA  on time San Francisco   503
 5 ALASKA  on time Seattle        1841
 6 ALASKA  delayed Los Angeles      62
 7 ALASKA  delayed Phoenix          12
 8 ALASKA  delayed San Diego        20
 9 ALASKA  delayed San Francisco   102
10 ALASKA  delayed Seattle         305
11 AM WEST on time Los Angeles     694
12 AM WEST on time Phoenix        4840
13 AM WEST on time San Diego       383
14 AM WEST on time San Francisco   320
15 AM WEST on time Seattle         201
16 AM WEST delayed Los Angeles     117
17 AM WEST delayed Phoenix         415
18 AM WEST delayed San Diego        65
19 AM WEST delayed San Francisco   129
20 AM WEST delayed Seattle          61

Perform analysis to compare the arrival delays for the two airlines.

AM WEST had more delays than ALASKA with 787 delayed flights.

By filtering the delayed flights and grouping by Airline I was able to summarize the flights by summing the amount for each Airline. I then built a bar plot to display the results.

# Filter the Delayed flights then group by Airline then summarize the flights by summing the amount of each Airline. Build a bar plot to display the results
my_flights |>
  filter(Arrival == "delayed") |>
  group_by(Airline) |>
  summarise(Flights = sum(Count)) |>
  arrange(desc(Flights)) |>
  ggplot(aes(x = reorder(Airline, -Flights), y = Flights)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Delayed Flights by Airline", x = "Airline", y = "Number of Flights") +
  geom_text(aes(label = Flights), vjust = -.3)

Of those delayed flights AM WEST had the most delays in Phoenix, while ALASKA had the most delays in Seattle.

I shifted analysis from Airline-based grouping to destination-based grouping, enabling visualization of delayed flights by their destination. Bar charts were constructed and placed side by side to facilitate comparative analysis.

my_flights |>
  filter(Arrival == "delayed" , Airline == "AM WEST") |>
  group_by(Destination) |>
  summarise(Flights = sum(Count)) |>
  arrange(desc(Flights)) |>
  ggplot(aes(x = reorder(Destination, -Flights), y = Flights)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Delayed Flights by Destination for AM WEST", x = "Destination", y = "Number of Flights") +
  geom_text(aes(label = Flights), vjust = -.3)
my_flights |>
  filter(Arrival == "delayed" , Airline == "ALASKA") |>
  group_by(Destination) |>
  summarise(Flights = sum(Count)) |>
  arrange(desc(Flights)) |>
  ggplot(aes(x = reorder(Destination, -Flights), y = Flights)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Delayed Flights by Destination for ALASKA", x = "Destination", y = "Number of Flights") +
  geom_text(aes(label = Flights), vjust = -.3)

Looking at the descriptive statistics of the delayed flights for both Airlines, ALASKA has a lower average of delayed flights with 100 compared to AM WEST 157 average delayed flights. AM WEST also has a higher deviation of delayed flights, which indicates that the number of the delayed flights are more variable than ALASKA.

To analyze the descriptive statistics of delayed flights for both airlines, a dedicated dataframe was initially constructed, exclusively containing delayed flights. Subsequently, the describeBy function (psych package) used. The first column was not needed so I eliminated it. Additionally, the first column label was changed to enhance clarity. Finally, the resulting data was transformed into a tibble format to promote readability.

# Summary of Descriptive statistics for the delayed flights grouped by Airline

delayed_flights <- my_flights |> 
  filter(Arrival =="delayed")
delayed_flights

# A tibble: 10 × 4
   Airline Arrival Destination   Count
   <chr>   <chr>   <chr>         <int>
 1 ALASKA  delayed Los Angeles      62
 2 ALASKA  delayed Phoenix          12
 3 ALASKA  delayed San Diego        20
 4 ALASKA  delayed San Francisco   102
 5 ALASKA  delayed Seattle         305
 6 AM WEST delayed Los Angeles     117
 7 AM WEST delayed Phoenix         415
 8 AM WEST delayed San Diego        65
 9 AM WEST delayed San Francisco   129
10 AM WEST delayed Seattle          61

delayed_flights_summary <- describeBy(delayed_flights$Count,group=delayed_flights$Airline,mat=TRUE)
delayed_flights_summary

    item  group1 vars n  mean       sd median trimmed     mad min max range
X11    1  ALASKA    1 5 100.2 120.0175     62   100.2 62.2692  12 305   293
X12    2 AM WEST    1 5 157.4 147.1625    117   157.4 77.0952  61 415   354
         skew  kurtosis       se
X11 0.8482653 -1.203946 53.67346
X12 0.9613965 -1.052983 65.81307

delayed_flights_summary <- delayed_flights_summary[,-1]
colnames(delayed_flights_summary)[1] <- "Airline"
as_tibble(delayed_flights_summary)

# A tibble: 2 × 14
  Airline  vars     n  mean    sd median trimmed   mad   min   max range  skew
  <chr>   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALASKA      1     5  100.  120.     62    100.  62.3    12   305   293 0.848
2 AM WEST     1     5  157.  147.    117    157.  77.1    61   415   354 0.961
# ℹ 2 more variables: kurtosis <dbl>, se <dbl>

Conclusion This assignment provided an opportunity to practice tidying and transforming data using the tidyr and dplyr packages. It also allowed me to explore descriptive statistical summary packages such as psych. The data was transformed from a wide format to a long format, and descriptive statistics were calculated for the delayed flights grouped by Airline. The results were visualized using bar plots to compare the arrival delays for the two airlines and to compare the arrival delays for the two airlines by destination.