Introduction

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. Youâre encouraged to use a âwideâ structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: ï· The URL to the .Rmd file in your GitHub repository. and ï· The URL for your rpubs.com web page.

Libraries

# Loading the required libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(knitr)
library(stringr)

Load the .csv file

# Now let's load csv file

airline <- read.csv("airlines.csv", sep=",", header=TRUE) # Due to some problem, I couldn't load the file from github
airline

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Data cleaning process

# Let's rename the variable's names first

airline2 <- rename(airline, airline_name = X, flight_status= X.1)

# Removing the NAs

airline2 <- na.omit(airline2) 

# Filling the missing values

airline2[2,1] <- 'ALASKA'  
airline2[4,1] <- 'AM WEST'

# Now let's merge all the airport's names together and make a single variable with its frequency as second variable

airline3 <- gather(airline2, airport_name, counts, 3:7)

# Now let's get rid of periods in city's name

airline3$airport_name <- str_replace_all(airline3$airport_name, "\\.", " ")

# Now let's spread air flight_status into two columns

airline4 <- airline3 %>% spread(flight_status, counts)

# Let's rename 'on time' as we are going to do some calculations 

airline4 <- rename(airline4, on_time = 'on time')
kable(airline4)

airline_name	airport_name	delayed	on_time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

Data analysis

# Now let's do some calculations for basic data analysis
airline4 <- mutate(airline4, total_flights = delayed + on_time)
airline4 <- mutate(airline4, delayed_ratio = delayed / total_flights)
kable(airline4)

airline_name	airport_name	delayed	on_time	total_flights	delayed_ratio
ALASKA	Los Angeles	62	497	559	0.1109123
ALASKA	Phoenix	12	221	233	0.0515021
ALASKA	San Diego	20	212	232	0.0862069
ALASKA	San Francisco	102	503	605	0.1685950
ALASKA	Seattle	305	1841	2146	0.1421249
AM WEST	Los Angeles	117	694	811	0.1442663
AM WEST	Phoenix	415	4840	5255	0.0789724
AM WEST	San Diego	65	383	448	0.1450893
AM WEST	San Francisco	129	320	449	0.2873051
AM WEST	Seattle	61	201	262	0.2328244

# Now let's calculate the average delays by both airlines but first we have to sum average 

airline5 <- airline4 %>%
  group_by(airline_name) %>%
  summarize(average_delays=mean(delayed_ratio))
kable(airline5)

airline_name	average_delays
ALASKA	0.1118683
AM WEST	0.1776915

ggplot(airline4, aes(x=airport_name, y=on_time, group=airline_name, color=airline_name))+geom_line()+labs(x=" City's Name", y="On Time Flights")

# Data visualization for delayed flights
ggplot(airline4, aes(x=airport_name, y=delayed, group=airline_name, color=airline_name))+geom_line()+labs(x="City's Name", y="Delays")

Conclusion

According to the above plot, it is pretty clear that AM WEST has the most number of delays as compared with ALASKA. Hence it is highly recommended to management of AM WEST to work on their delays as it may cause loss to their business. On the other side, ALASKA has some delays too but comparatively their number of delays are lower than AM WEST. If we take a look at first plot, it shows that AM WEST has more number of on-time flights on an average as compared with ALASKA other than in San Francisco and Seattle. For the sake of business recommendation, it seems that overall AM WEST is performing better than ALASKA but it needs to improve its business operations in San Francisco and Seattle where ALASKA is doing better.

Data 607 - Assignment 4 / Working with tidy data

Habib Khan

September 28, 2019