The chart above describes arrival delays for two airlines across five destinations. Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.
You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

The URL to the .Rmd file in your GitHub repository and The URL for your rpubs.com web page.

PROCEDURE

Library definition

# Need to employ kable
library(knitr)
# Need to employ stringr for Regular Expressions
library(stringr)
# Need to employ to use tidy data functions
library(tidyr)
library(dplyr)

(1) Create .CSV file

I have created a .CSV file named: “Villalobos-airlines.csv”

$Villalobos-airlines.csv$

Villalobos-airlines.csv

(2) Read information from .CSV file into R.

For simplicity and reproducibility reasons, I have posted this file on my GitHub repository as follows:

GitHub URL

url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework5/Villalobos-airlines.csv"

Read .csv from url by employing read.csv()

my.data <- read.csv(url, header=FALSE, sep=",", stringsAsFactors=FALSE)
my.data <- data.frame(my.data)

Imported file display

##        V1      V2          V3      V4        V5            V6     V7
## 1                 Los Angeles Phoenix San Diego San Francisco Seatle
## 2  ALASKA on time         497     221       212           503   1841
## 3         delayed          62      12        20           102    305
## 4                                                                   
## 5 AM WEST on time         694    4840       383           320    201
## 6         delayed         117     415        65           129     61

Data transformation

Renaming Column headers

# Adding "Missing" titles from original file onto the Row #1
my.data$V1[1] <- "Airline"
my.data$V2[1] <- "Status"

# Assigning all the values from the row #1 as the Column Headers
names(my.data) <- my.data[1,]

# Need to eliminate Row #1 in order to keep data consistency.
my.data <- my.data[-c(1), ]

Table displaying correct column titles.

	Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seatle
2	ALASKA	on time	497	221	212	503	1841
3		delayed	62	12	20	102	305
4
5	AM WEST	on time	694	4840	383	320	201
6		delayed	117	415	65	129	61

Eliminating Empty rows with “NA” values by employing drop_na() from the tidy library.

For this, I have to transform our data as follows:

## 'data.frame':    5 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "" "" "AM WEST" ...
##  $ Status       : chr  "on time" "delayed" "" "on time" ...
##  $ Los Angeles  : chr  "497" "62" "" "694" ...
##  $ Phoenix      : chr  "221" "12" "" "4840" ...
##  $ San Diego    : chr  "212" "20" "" "383" ...
##  $ San Francisco: chr  "503" "102" "" "320" ...
##  $ Seatle       : chr  "1841" "305" "" "201" ...

Procedure to transform values into integers

for (i in 3:dim(my.data)[2]){
  my.data[,i] <- as.integer(my.data[,i])
}

Preview of data after transformation

## 'data.frame':    5 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "" "" "AM WEST" ...
##  $ Status       : chr  "on time" "delayed" "" "on time" ...
##  $ Los Angeles  : int  497 62 NA 694 117
##  $ Phoenix      : int  221 12 NA 4840 415
##  $ San Diego    : int  212 20 NA 383 65
##  $ San Francisco: int  503 102 NA 320 129
##  $ Seatle       : int  1841 305 NA 201 61

Procedure to eliminate all the NA lines from our original file by employing drop_na()

my.data <- my.data %>% drop_na()

	Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seatle
2	ALASKA	on time	497	221	212	503	1841
3		delayed	62	12	20	102	305
5	AM WEST	on time	694	4840	383	320	201
6		delayed	117	415	65	129	61

Adding missing Airline name to “delayed” row

for (i in 1:dim(my.data)[1]){
  if (i %% 2 == 0){
      my.data$Airline[i] <-  my.data$Airline[i-1]
  }
}

Final completed table in order to start employing tidy transformations for further analysis.

	Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seatle
2	ALASKA	on time	497	221	212	503	1841
3	ALASKA	delayed	62	12	20	102	305
5	AM WEST	on time	694	4840	383	320	201
6	AM WEST	delayed	117	415	65	129	61

(3) Analysis

First: we need to transform our table by employing gather() from tidyr library.

# Tidy table by having 4 variables (Airline, Status, City, number of flights)
my.tidy.data <- my.data %>% gather("City","n flights", 3:7)

Airline	Status	City	n flights
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12

Second: Now, I will separate the values “on time” and “delayed” from the Status column into two different columns by employing the spread() function from tidyr library.

Please note that these values can be considered as two different variables.

my.tidy.data <-  my.tidy.data %>%
                 spread(Status, `n flights`)

Final Tidy Table

Airline	City	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seatle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seatle	61	201

a) Total of flights sorted ascending.

Airline	City	delayed	on time	Total
ALASKA	San Diego	20	212	232
ALASKA	Phoenix	12	221	233
AM WEST	Seatle	61	201	262
AM WEST	San Diego	65	383	448
AM WEST	San Francisco	129	320	449
ALASKA	Los Angeles	62	497	559
ALASKA	San Francisco	102	503	605
AM WEST	Los Angeles	117	694	811
ALASKA	Seatle	305	1841	2146
AM WEST	Phoenix	415	4840	5255

b) Ratio of “delayed” flights vs “on time” flights sorted ascending.

Airline	City	delayed	on time	Total	ratio
ALASKA	Phoenix	12	221	233	0.0542986
AM WEST	Phoenix	415	4840	5255	0.0857438
ALASKA	San Diego	20	212	232	0.0943396
ALASKA	Los Angeles	62	497	559	0.1247485
ALASKA	Seatle	305	1841	2146	0.1656708
AM WEST	Los Angeles	117	694	811	0.1685879
AM WEST	San Diego	65	383	448	0.1697128
ALASKA	San Francisco	102	503	605	0.2027833
AM WEST	Seatle	61	201	262	0.3034826
AM WEST	San Francisco	129	320	449	0.4031250

Noticed how the order changed!

c) Total of flights by Airline sorted ascending.

Airline	delayed	on time	Total
ALASKA	501	3274	3775
AM WEST	787	6438	7225

d) Total of flights by City sorted ascending.

City	delayed	on time	Total
San Diego	85	595	680
San Francisco	231	823	1054
Los Angeles	179	1191	1370
Seatle	366	2042	2408
Phoenix	427	5061	5488

Some Horizontal Probabilities

From the previous “Final Tidy Table” we can find some horizontal probabilities, those can be found as follows:

Airline	City	delayed	on time	Total
ALASKA	San Diego	20	212	232
ALASKA	Phoenix	12	221	233
AM WEST	Seatle	61	201	262
AM WEST	San Diego	65	383	448
AM WEST	San Francisco	129	320	449
ALASKA	Los Angeles	62	497	559
ALASKA	San Francisco	102	503	605
AM WEST	Los Angeles	117	694	811
ALASKA	Seatle	305	1841	2146
AM WEST	Phoenix	415	4840	5255

a) Horizontal probabilities for “delayed” and “on time” flights by Airline and City.

Airline	City	delayed	on time	Total	P(delayed)	P(on time)
ALASKA	San Diego	20	212	232	0.0862069	0.9137931
ALASKA	Phoenix	12	221	233	0.0515021	0.9484979
AM WEST	Seatle	61	201	262	0.2328244	0.7671756
AM WEST	San Diego	65	383	448	0.1450893	0.8549107
AM WEST	San Francisco	129	320	449	0.2873051	0.7126949
ALASKA	Los Angeles	62	497	559	0.1109123	0.8890877
ALASKA	San Francisco	102	503	605	0.1685950	0.8314050
AM WEST	Los Angeles	117	694	811	0.1442663	0.8557337
ALASKA	Seatle	305	1841	2146	0.1421249	0.8578751
AM WEST	Phoenix	415	4840	5255	0.0789724	0.9210276

b) Horizontal probabilities for “delayed” and “on time” flights by Airline only.

Airline	delayed	on time	Total
ALASKA	501	3274	3775
AM WEST	787	6438	7225

Airline	delayed	on time	Total	P(A delayed)	P(A on time)
ALASKA	501	3274	3775	0.1327152	0.8672848
AM WEST	787	6438	7225	0.1089273	0.8910727

c) Horizontal probability for “delayed” and “on time” flights by City only.

City	delayed	on time	Total
San Diego	85	595	680
San Francisco	231	823	1054
Los Angeles	179	1191	1370
Seatle	366	2042	2408
Phoenix	427	5061	5488

City	delayed	on time	Total	P(C delayed)	P(C on time)
San Diego	85	595	680	0.1250000	0.8750000
San Francisco	231	823	1054	0.2191651	0.7808349
Los Angeles	179	1191	1370	0.1306569	0.8693431
Seatle	366	2042	2408	0.1519934	0.8480066
Phoenix	427	5061	5488	0.0778061	0.9221939

Joining tables with horizontal probabilities

For this, I will join the Final Tidy Table with the respective probabilities tables in order to create a Final Horizontal Probability Table by employing inner_join() from the dplyr library.

Resulting Horizontal Probability table:

Airline	City	P(A delayed)	P(A on time)	P(C delayed)	P(C on time)
ALASKA	Los Angeles	0.1327152	0.8672848	0.1306569	0.8693431
ALASKA	Phoenix	0.1327152	0.8672848	0.0778061	0.9221939
ALASKA	San Diego	0.1327152	0.8672848	0.1250000	0.8750000
ALASKA	San Francisco	0.1327152	0.8672848	0.2191651	0.7808349
ALASKA	Seatle	0.1327152	0.8672848	0.1519934	0.8480066
AM WEST	Los Angeles	0.1089273	0.8910727	0.1306569	0.8693431
AM WEST	Phoenix	0.1089273	0.8910727	0.0778061	0.9221939
AM WEST	San Diego	0.1089273	0.8910727	0.1250000	0.8750000
AM WEST	San Francisco	0.1089273	0.8910727	0.2191651	0.7808349
AM WEST	Seatle	0.1089273	0.8910727	0.1519934	0.8480066

By comparing with the below table, we noticed that the values are different. That is that so far I have calculated only horizontal probabilities and further analysis can be performed.

Airline	City	delayed	on time	Total	P(delayed)	P(on time)
ALASKA	San Diego	20	212	232	0.0862069	0.9137931
ALASKA	Phoenix	12	221	233	0.0515021	0.9484979
AM WEST	Seatle	61	201	262	0.2328244	0.7671756
AM WEST	San Diego	65	383	448	0.1450893	0.8549107
AM WEST	San Francisco	129	320	449	0.2873051	0.7126949
ALASKA	Los Angeles	62	497	559	0.1109123	0.8890877
ALASKA	San Francisco	102	503	605	0.1685950	0.8314050
AM WEST	Los Angeles	117	694	811	0.1442663	0.8557337
ALASKA	Seatle	305	1841	2146	0.1421249	0.8578751
AM WEST	Phoenix	415	4840	5255	0.0789724	0.9210276

Other probabilities

Other probabilities that can be found will be Vertical Probabilities or/and Total Probabilities by taking into consideration the total flights (“delayed” + “on time”).

Once those probabilities are found we can then start answering questions like:

What’s the probability that a randonmly selected flight will be delayed?
What’s the probability that a randonly selected flight will be from Alaska Airlines and it’s destination will be Seatle?
What’s the probability that out of 5 randonmly selected flights will be 2 delayed and 3 on time?