Introduction
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. Youâre encouraged to use a âwideâ structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
- Perform analysis to compare the arrival delays for the two airlines.
- Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: ï· The URL to the .Rmd file in your GitHub repository. and ï· The URL for your rpubs.com web page.
Libraries
# Loading the required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
library(stringr)
Load the .csv file
# Now let's load csv file
airline <- read.csv("airlines.csv", sep=",", header=TRUE) # Due to some problem, I couldn't load the file from github
airline
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Data cleaning process
# Let's rename the variable's names first
airline2 <- rename(airline, airline_name = X, flight_status= X.1)
# Removing the NAs
airline2 <- na.omit(airline2)
# Filling the missing values
airline2[2,1] <- 'ALASKA'
airline2[4,1] <- 'AM WEST'
# Now let's merge all the airport's names together and make a single variable with its frequency as second variable
airline3 <- gather(airline2, airport_name, counts, 3:7)
# Now let's get rid of periods in city's name
airline3$airport_name <- str_replace_all(airline3$airport_name, "\\.", " ")
# Now let's spread air flight_status into two columns
airline4 <- airline3 %>% spread(flight_status, counts)
# Let's rename 'on time' as we are going to do some calculations
airline4 <- rename(airline4, on_time = 'on time')
kable(airline4)
ALASKA |
Los Angeles |
62 |
497 |
ALASKA |
Phoenix |
12 |
221 |
ALASKA |
San Diego |
20 |
212 |
ALASKA |
San Francisco |
102 |
503 |
ALASKA |
Seattle |
305 |
1841 |
AM WEST |
Los Angeles |
117 |
694 |
AM WEST |
Phoenix |
415 |
4840 |
AM WEST |
San Diego |
65 |
383 |
AM WEST |
San Francisco |
129 |
320 |
AM WEST |
Seattle |
61 |
201 |
Data analysis
# Now let's do some calculations for basic data analysis
airline4 <- mutate(airline4, total_flights = delayed + on_time)
airline4 <- mutate(airline4, delayed_ratio = delayed / total_flights)
kable(airline4)
ALASKA |
Los Angeles |
62 |
497 |
559 |
0.1109123 |
ALASKA |
Phoenix |
12 |
221 |
233 |
0.0515021 |
ALASKA |
San Diego |
20 |
212 |
232 |
0.0862069 |
ALASKA |
San Francisco |
102 |
503 |
605 |
0.1685950 |
ALASKA |
Seattle |
305 |
1841 |
2146 |
0.1421249 |
AM WEST |
Los Angeles |
117 |
694 |
811 |
0.1442663 |
AM WEST |
Phoenix |
415 |
4840 |
5255 |
0.0789724 |
AM WEST |
San Diego |
65 |
383 |
448 |
0.1450893 |
AM WEST |
San Francisco |
129 |
320 |
449 |
0.2873051 |
AM WEST |
Seattle |
61 |
201 |
262 |
0.2328244 |
# Now let's calculate the average delays by both airlines but first we have to sum average
airline5 <- airline4 %>%
group_by(airline_name) %>%
summarize(average_delays=mean(delayed_ratio))
kable(airline5)
ALASKA |
0.1118683 |
AM WEST |
0.1776915 |
ggplot(airline4, aes(x=airport_name, y=on_time, group=airline_name, color=airline_name))+geom_line()+labs(x=" City's Name", y="On Time Flights")

# Data visualization for delayed flights
ggplot(airline4, aes(x=airport_name, y=delayed, group=airline_name, color=airline_name))+geom_line()+labs(x="City's Name", y="Delays")

Conclusion
According to the above plot, it is pretty clear that AM WEST has the most number of delays as compared with ALASKA. Hence it is highly recommended to management of AM WEST to work on their delays as it may cause loss to their business. On the other side, ALASKA has some delays too but comparatively their number of delays are lower than AM WEST. If we take a look at first plot, it shows that AM WEST has more number of on-time flights on an average as compared with ALASKA other than in San Francisco and Seattle. For the sake of business recommendation, it seems that overall AM WEST is performing better than ALASKA but it needs to improve its business operations in San Francisco and Seattle where ALASKA is doing better.