Introduction

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

Libraries

# Loading the required libraries

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
library(stringr)

Load the .csv file

# Now let's load csv file

airline <- read.csv("airlines.csv", sep=",", header=TRUE) # Due to some problem, I couldn't load the file from github
airline
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Data cleaning process

# Let's rename the variable's names first

airline2 <- rename(airline, airline_name = X, flight_status= X.1)

# Removing the NAs

airline2 <- na.omit(airline2) 

# Filling the missing values

airline2[2,1] <- 'ALASKA'  
airline2[4,1] <- 'AM WEST'

# Now let's merge all the airport's names together and make a single variable with its frequency as second variable

airline3 <- gather(airline2, airport_name, counts, 3:7)

# Now let's get rid of periods in city's name

airline3$airport_name <- str_replace_all(airline3$airport_name, "\\.", " ")

# Now let's spread air flight_status into two columns

airline4 <- airline3 %>% spread(flight_status, counts)

# Let's rename 'on time' as we are going to do some calculations 

airline4 <- rename(airline4, on_time = 'on time')
kable(airline4)
airline_name airport_name delayed on_time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

Data analysis

# Now let's do some calculations for basic data analysis
airline4 <- mutate(airline4, total_flights = delayed + on_time)
airline4 <- mutate(airline4, delayed_ratio = delayed / total_flights)
kable(airline4)
airline_name airport_name delayed on_time total_flights delayed_ratio
ALASKA Los Angeles 62 497 559 0.1109123
ALASKA Phoenix 12 221 233 0.0515021
ALASKA San Diego 20 212 232 0.0862069
ALASKA San Francisco 102 503 605 0.1685950
ALASKA Seattle 305 1841 2146 0.1421249
AM WEST Los Angeles 117 694 811 0.1442663
AM WEST Phoenix 415 4840 5255 0.0789724
AM WEST San Diego 65 383 448 0.1450893
AM WEST San Francisco 129 320 449 0.2873051
AM WEST Seattle 61 201 262 0.2328244
# Now let's calculate the average delays by both airlines but first we have to sum average 

airline5 <- airline4 %>%
  group_by(airline_name) %>%
  summarize(average_delays=mean(delayed_ratio))
kable(airline5)
airline_name average_delays
ALASKA 0.1118683
AM WEST 0.1776915
ggplot(airline4, aes(x=airport_name, y=on_time, group=airline_name, color=airline_name))+geom_line()+labs(x=" City's Name", y="On Time Flights")

# Data visualization for delayed flights
ggplot(airline4, aes(x=airport_name, y=delayed, group=airline_name, color=airline_name))+geom_line()+labs(x="City's Name", y="Delays")

Conclusion

According to the above plot, it is pretty clear that AM WEST has the most number of delays as compared with ALASKA. Hence it is highly recommended to management of AM WEST to work on their delays as it may cause loss to their business. On the other side, ALASKA has some delays too but comparatively their number of delays are lower than AM WEST. If we take a look at first plot, it shows that AM WEST has more number of on-time flights on an average as compared with ALASKA other than in San Francisco and Seattle. For the sake of business recommendation, it seems that overall AM WEST is performing better than ALASKA but it needs to improve its business operations in San Francisco and Seattle where ALASKA is doing better.