##Introduction
This notebook is for the use of cleaning and organizing flight information for AM West and Alaska airlines. The data shows the number of flights that were on time and delayed for 5 major airports. The time frame and whether or not eh flights are incoming or outbound was not provided.
Load data into a data frame for modeling and visualization. A minor amount of cleaning needed to be done before visualization.
library(RCurl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- getURL('https://raw.githubusercontent.com/KevinJpotter/data_607/master/data/flights%20sample%20-%20data%20606%20-%20Sheet1.csv')
df <- read.csv( text = data)
# rename columns
df <- rename(df, airline = X)
df <- rename(df, status = X.1)
# drop rows with null values
df<- na.omit(df)
# Fill in missing data
df[2,1] <- 'ALASKA'
df[4,1] <- 'AM WEST'
# convert numeric data
df$Phoenix<- as.integer(df$Phoenix)
df$Seattle <- as.integer(df$Seattle)
# look at data
head(df)
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Create bar-plots to compare draw comparisons about the delayed and on time flights for the airports and airlines.
delayed <- subset(df, status == 'delayed')[,3:7]
barplot(t(delayed),
main = 'Delayed Flights',
ylab = 'Flights',
xlab = 'Airline',
names.arg = c('ALASKA', 'AM West'),
legend = c('LA', "Phoenix", 'Sand Diego',"SF", "Seattle"),
beside = TRUE,
col = rainbow(5)
)
on_time <- subset(df, status == 'on time')[,3:7]
barplot(t(on_time),
main = 'On Time Flights',
ylab = 'Flights',
xlab = 'Airline',
names.arg = c('ALASKA', 'AM West'),
legend = c('LA', "Phoenix", 'Sand Diego',"SF", "Seattle"),
beside = TRUE,
col = rainbow(5)
)
am_west <- subset(df, airline == 'AM WEST')[,3:7]
barplot(t(am_west),
main = 'AM WestFlights',
ylab = 'Flights',
xlab = 'Status',
names.arg = c('On Time', 'Delayed'),
legend = c('LA', "Phoenix", 'Sand Diego',"SF", "Seattle"),
beside = TRUE,
col = rainbow(5)
)
alaska <- subset(df, airline == 'ALASKA')[,3:7]
barplot(t(alaska),
main = 'Alaska Flights',
ylab = 'Flights',
xlab = 'Status',
names.arg = c('On Time', 'Delayed'),
legend = c('LA', "Phoenix", 'Sand Diego',"SF", "Seattle"),
beside = TRUE,
col = rainbow(5)
)
### Further Exploratory Data Analysis
Look at the % of delated flights per airport to draw further conclusions.
# % delayed flights for Alaska airline per airport
print(alaska[2,]/ sum(alaska)*100)
## Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 2 1.642384 0.3178808 0.5298013 2.701987 8.07947
# % of delated flights for AM West airline per airport
print(am_west[2,]/ sum(am_west)*100)
## Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 5 1.619377 5.743945 0.899654 1.785467 0.8442907
After looking over the data it appears each airline has problems with delayed flights on different airports. Alaska airline has the most % of delated flights in Seattle where AM West has the highest % of delated flights in Phoenix. These airports are where the most flights are from for both airports. This would leave me to believe the biggest cause for delay is the airline getting planes in out and ready rather than the airport not working efficiently.
What I would assume to be larger airports LA and SF seem to have a slightly higher % of delays with not a large number of flights for the airline which could mean it is an issue with the airport.