DATA 607 - Homework Assignment # 4

Vladimir Nimchenko

1.Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
 #manually created csv from the chart,loaded it to github and added it to an object called: "flight_schedule"
 flight_schedule <- read.csv("https://raw.githubusercontent.com/GitHub-Vlad/Data-Science/main/flight_arrivals.csv",header = TRUE)

2.Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

The following changes should be made to make this file more complete:

  1. Add “Airline” as the first column header and “Flight Status” as second column header.
  2. Add “AMWEST” to the second row (for delayed status) and “ALASKA” to the fifth row (for delayed status)
  3. Delete the third row because it has no data and serves no purpose
#Adding "Airline" as first column header" and "Flight Status" as second column header.
colnames(flight_schedule)[1:2] <- c("Airline", "Status")

#ALASKA" to the airline column second row of the "flight_schedule" data frame
 flight_schedule[2,1] <- ("ALASKA") 
 
 #AMWEST" to the airline column fourth row of the "flight_schedule" data frame
 flight_schedule[5,1] <- ("AM WEST") 
 
  #removing the third row which has all NA values
  flight_schedule <- na.omit( flight_schedule)
  
  #print out the data after it has been transformed
 head(flight_schedule)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61
  1. Perform analysis to compare the arrival delays for the two airlines.

Firstly, to compare arrival delays for each airline, I will calculate the “total delay” and “average delay” for ALASKA and AMWEST.

#utilize rowSums function to find the total delay for both AMWEST and ALASKA ---- summing up data for all rows, columns 3 to 7 and placing the sum of each row into a newly created column called: "total"
flight_schedule$total <- rowSums(flight_schedule[,3:7])


#utilize rowMeans function to find the average delay for both AMWEST and ALASKA ---- dividing the sum for each row by the number of columns data for all rows, columns 3 to 7 and calling the newly created column called: "mean"
flight_schedule$mean <- rowMeans(flight_schedule[,3:7])

#print out the average mean delay for AMWEST and ALASKA
head(flight_schedule$mean)
## [1]  654.8  100.2 1287.6  157.4

It is evident that “AMWEST”, by looking at the total and mean of the data set, has the largest arrival delay.

Now that we know the airline with the largest arrival delay, I think it would be useful to find out at which city we had the maximum delay and at which city we had the minimum delay.

#Find out which city had the minimum flight delay. compare all rows of columns 3 to 7 and place the result for each row in a column called: "min_delay"
flight_schedule$min_delay <- apply(flight_schedule[3:7], 1, FUN = min)

#print out the city with the minimum delay
print(flight_schedule$min_delay )
## [1] 212  12 201  61
#Find out which city had the maximum flight delay. compare all rows of columns 3 to 7 and place the result for each row in a column called: "max_delay"
flight_schedule$max_delay <- apply(flight_schedule[,3:7], 1, FUN = max)

#print out the city with the maximum delay
print(flight_schedule$max_delay)
## [1] 1841  305 4840  415

From the results above, we can see that Phoenix had both the minimum (12) and maximum (4840) delays.

Knowing the total number of flights which were delayed and the ones which were on time, it would be interesting to forecast which airline will have a bigger chance of being delayed. This information is very useful because it can help the flight crew allocate the needed resources to the airline which has a bigger to delay to try and make the necessary adjustments to have it arrive on time.

#Adding the total number of flights for "ALASKA" ---- both on time and delayed
flight_total_AL <- flight_schedule[1,8] + flight_schedule[2,8]

#Adding the total number of flights for "AMWEST" ---- both on time and delayed
flight_total_AM <- flight_schedule[3,8] + flight_schedule[4,8]

#calculating the ratio(percentage) of "delayed" flights for "ALASKA".
flight_delay_AL = flight_schedule[2,8]/flight_total_AL

 #calculating the ratio(percentage) of "delayed" flights for "AM".
 flight_delay_AM = flight_schedule[4,8]/flight_total_AM 
 
 #print out the flight delay percentage for AM WEST
 print(flight_delay_AM )
## [1] 0.1089273
 #print out the flight delay percentage for ALASKA
 print(flight_delay_AL)
## [1] 0.1327152

from the above analysis, we see that the chance of delay for ALASKA is around 13.3%. We also see that the chance of delay for AM is around 10.9%. We see that “ALASKA” has the higher chance of delays than “AM”. It is clear that more resources need to be spend on ALASKA airlines to ensure that their percentage of delayed flights goes down.

Now I want to create a bar plot for the Alaska Airline delays to display the total delays for each city using Alaska Airlines.

#Creating vectors for the cities (X-axis) and the total delay fir each (y-axis)
x= c("Los Angeles","Phoenix","San_Diego","San_Francisco","Seattle")
y= c(62,12,20,102,305)

#Plotting a bar plot of the Alaska Airline delays
barplot(y, mainlab= "Total Delays by City", xlab= "City", ylab= "Total Delays",
        names.arg= x)
## Warning in plot.window(xlim, ylim, log = log, ...): "mainlab" is not a graphical
## parameter
## Warning in axis(if (horiz) 2 else 1, at = at.l, labels = names.arg, lty =
## axis.lty, : "mainlab" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "mainlab" is not a graphical parameter
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): "mainlab" is not
## a graphical parameter