Data607-Lab5

R Markdown

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

I will begin by creating a data frame which will include the information from the CSV file. We will use the link from the raw github code of the CSV file that we uploaded to our repository which includes all the information in the table. We will then load the libraries necessary to tidy this data which are tidyr and dplyr.

flightscsv1 <- read.csv("https://raw.githubusercontent.com/Zcash95/DATA607-5/main/DATA607airlineschedule%20-%20Sheet1.csv")
head(flightscsv1)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On-time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West On-time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We’ll being by adding names to the columns that have none such as the first and second columns which indicate the airline and the status of the flight. We will add names to the airlines column to indicate which airline the delayed flights are from. We will then proceed by deleting any empty rows and as we can see the entire cells of row 3 are empty so we will just delete that row

flightscsv1 <- flightscsv1 %>%
rename("Airline" = 1, "Status" = 2, "Los Angeles" = 3, "Phoenix" = 4, "San Diego" = 5, "San Francisco" = 6, "Seattle" = 7)

print(flightscsv1)

##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska On-time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West On-time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61

flightscsv1[2, "Airline"] <- "Alaska"
flightscsv1[5, "Airline"] <- "AM West"

flightscsv1 <- flightscsv1 %>%
na.omit(flightscsv)

head(flightscsv1)

##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska On-time         497     221       212           503    1841
## 2  Alaska Delayed          62      12        20           102     305
## 4 AM West On-time         694    4840       383           320     201
## 5 AM West Delayed         117     415        65           129      61

We will then attempt to tidy the data further by putting in a long format which is in rows. This will help us complete the analysis that we need to by putting the data in a longer format of rows instead of spread out by columns

flightscsv_long <- pivot_longer(flightscsv1, cols = -c(Airline, Status), names_to = "Destination", values_to = "Flights")
print(flightscsv_long)

## # A tibble: 20 × 4
##    Airline Status  Destination   Flights
##    <chr>   <chr>   <chr>           <int>
##  1 Alaska  On-time Los Angeles       497
##  2 Alaska  On-time Phoenix           221
##  3 Alaska  On-time San Diego         212
##  4 Alaska  On-time San Francisco     503
##  5 Alaska  On-time Seattle          1841
##  6 Alaska  Delayed Los Angeles        62
##  7 Alaska  Delayed Phoenix            12
##  8 Alaska  Delayed San Diego          20
##  9 Alaska  Delayed San Francisco     102
## 10 Alaska  Delayed Seattle           305
## 11 AM West On-time Los Angeles       694
## 12 AM West On-time Phoenix          4840
## 13 AM West On-time San Diego         383
## 14 AM West On-time San Francisco     320
## 15 AM West On-time Seattle           201
## 16 AM West Delayed Los Angeles       117
## 17 AM West Delayed Phoenix           415
## 18 AM West Delayed San Diego          65
## 19 AM West Delayed San Francisco     129
## 20 AM West Delayed Seattle            61

We analyzed the data by creating new colums names On-time and delayed which counted the flights that were either on time or delayed. We then proceeded to add the total of delayed and on time flights and divided it by the number of delayed and this gave us a proportion indicating that Alaska airlines had more flights on time then AM West.

performance_data <- flightscsv_long %>%
  group_by(Airline, Status) %>%
  summarise(Count = n())

## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.

performance_data <- performance_data %>%
  mutate(
  OnTimeCount = ifelse(Status == "On-time", Count, 0),
  DelayedCount = ifelse(Status == "Delayed", Count, 0)) %>%
  select(Airline, OnTimeCount, DelayedCount)


performance_data <- performance_data %>%
  group_by(Airline) %>%
  summarise( OnTime = sum(OnTimeCount),Delayed = sum(DelayedCount))

performance_data <- performance_data %>%
  mutate(OnTimePercentage = (OnTime / (OnTime + Delayed)) * 100, 
  DelayedPercentage = (Delayed / (OnTime + Delayed)) * 100)

print(performance_data)

## # A tibble: 2 × 5
##   Airline OnTime Delayed OnTimePercentage DelayedPercentage
##   <chr>    <dbl>   <dbl>            <dbl>             <dbl>
## 1 AM West      5       5               50                50
## 2 Alaska       5       5               50                50

The conclusion that we got after analyzing our data is that Alaska airlines had more flights on time then AM West did.

Data607-Lab5

2024-02-20

R Markdown