Data 607 Week 5 Assignment

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

A CSV file is created by arranging the provided data in a table format, with rows representing airlines, their statuses (On Time or Delayed), and arrival delays at different destinations. The resulting CSV file, named “DATA_607_Wk5.csv,” is generated and saved for further analysis

Functions used:

r bind: the rbind function is utilized to row-bind multiple vectors containing airline-related data, such as airline names, statuses, and arrival delays, into a single matrix named csv, representing tabular information.

write.table: he write.table function is used to write data to a CSV file, specifying the data, file name, separator character, and options for excluding column and row names.

# Create a CSV file by inputting the data 
csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
             c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
             c(NA, "Delayed", 62, 12, 20, 102, 305),
             c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
             c(NA, "Delayed", 117, 415, 65, 129, 61))

write.table(csv, file = "DATA_607_Wk5.csv", sep = ",", col.names=F, row.names=F)

Read the information from your .CSV file into R The CSV has been uploaded to github, and has been imported using the github link. The data is then displayed.

# Read the CSV file into R using link
flights_data <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data-607/main/DATA_607_Wk5.csv", stringsAsFactors = FALSE)

# Display 
flights_data

##       NA.   NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2    <NA> Delayed          62      12        20           102     305
## 3 AM WEST On Time         694    4840       383           320     201
## 4    <NA> Delayed         117     415        65           129      61

Continued: use tidyr and dplyr as needed to tidy and transform your data.

The previously created flights_data is transformed into a “long format” using the tidyr and dplyr packages. The data is reshaped to have separate rows for each airline, their statuses, and arrival delays at different destinations, facilitating further analysis and comparisons between the airlines’ performance.

Functions used: mutate: is used to create new columns or modify existing columns in a dataframe. ‘Mutate(Airlines1 = NA., Airlines2=lag(NA.))’: It creates two new columns, Airlines1 and Airlines2, with missing values (NA) and sets Airlines2 to be the lag (previous) value of Airlines1. ‘mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1)’: It creates two more new columns, Airline and Status. Airline is populated with non-missing values from Airlines1 and Airlines2, ensuring that each row now has an airline name. Status is set to the value of the second column in the original data (the “Status” column), which indicates whether the flight was on time or delayed.

gather: gather is used to convert wide data (data with multiple columns) into long data (data with fewer columns but more rows).

select: is used to choose specific columns from a dataframe.

arrange: is used to reorder rows in a dataframe based on specific column values.

# Import packages as necessary 
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Convert to long format 
long_data <- flights_data %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>% 
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1)  %>%
gather("Destination", "Flights", 3:7) %>% 
select(Airline:Flights) %>% 
arrange(Airline, desc(Status), Destination); long_data

##    Airline  Status   Destination Flights
## 1   ALASKA On Time   Los.Angeles     497
## 2   ALASKA On Time       Phoenix     221
## 3   ALASKA On Time     San.Diego     212
## 4   ALASKA On Time San.Francisco     503
## 5   ALASKA On Time       Seattle    1841
## 6   ALASKA Delayed   Los.Angeles      62
## 7   ALASKA Delayed       Phoenix      12
## 8   ALASKA Delayed     San.Diego      20
## 9   ALASKA Delayed San.Francisco     102
## 10  ALASKA Delayed       Seattle     305
## 11 AM WEST On Time   Los.Angeles     694
## 12 AM WEST On Time       Phoenix    4840
## 13 AM WEST On Time     San.Diego     383
## 14 AM WEST On Time San.Francisco     320
## 15 AM WEST On Time       Seattle     201
## 16 AM WEST Delayed   Los.Angeles     117
## 17 AM WEST Delayed       Phoenix     415
## 18 AM WEST Delayed     San.Diego      65
## 19 AM WEST Delayed San.Francisco     129
## 20 AM WEST Delayed       Seattle      61

Perform analysis to compare the arrival delays for the two airlines.

To conduct analysis, the mean arrival delays for the two airlines (ALASKA and AM WEST) at various destinations are calculated and visualized to provide insights into their punctuality performance. The mean delay values are presented in a tabular format, and a bar chart is generated to visually compare the arrival delays for each airline at different destinations, facilitating a better understanding of the data set.

Functions: group by function: is used to group rows of a dataframe based on one or more columns. In this case, it’s grouping the data by two columns, “Airline” and “Destination.”

summarize: is used to calculate summary statistics within each group created by ‘group_by’. g gplot function (from the ggplot2 package): ggplot is used for creating data visualizations, such as plots and charts.

library(dplyr)

# Calculate mean delays for each airline and destination
mean_delays <- long_data %>%
  group_by(Airline, Destination) %>%
  summarize(Mean_Delay = mean(Flights))

## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.

# Print the mean_delays data frame
mean_delays

## # A tibble: 10 × 3
## # Groups:   Airline [2]
##    Airline Destination   Mean_Delay
##    <chr>   <chr>              <dbl>
##  1 ALASKA  Los.Angeles         280.
##  2 ALASKA  Phoenix             116.
##  3 ALASKA  San.Diego           116 
##  4 ALASKA  San.Francisco       302.
##  5 ALASKA  Seattle            1073 
##  6 AM WEST Los.Angeles         406.
##  7 AM WEST Phoenix            2628.
##  8 AM WEST San.Diego           224 
##  9 AM WEST San.Francisco       224.
## 10 AM WEST Seattle             131

# Create a bar chart to visualize mean delays of destination and airline
library(ggplot2)

ggplot(mean_delays, aes(x = Destination, y = Mean_Delay, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Mean Arrival Delays by Destination and Airline",
    x = "Destination",
    y = "Mean Delay",
    fill = "Airline"
  ) +
  theme_minimal()

Conclusion: Studying the outputs of the mean, and the visual graph, it can be seen that: ALASKA Airlines generally has lower mean arrival delays compared to AM WEST at most destinations, except for San Francisco, where it experiences a significantly higher mean delay of 302.5 minutes, indicating potential punctuality issues.

AM WEST Airlines generally experiences significantly higher mean arrival delays, notably at Phoenix and San Francisco, with delays of 2627.5 minutes and 224.5 minutes, respectively, although it performs relatively better at Los Angeles, San Diego, and Seattle, still encountering delays.

Data 607 Week 5 Assignment

2023-09-29