Tidying and Transforming Data

Assignment

1. Create a .CSV file that includes the Airline information .

2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

3. Perform analysis to compare the arrival delays for the two airlines.

Load Data

1.1 Load CSV file

Load CSV file from desktop and validate it using Head.

untidyData <-  read.csv(paste0("C:/data/tidyingData.csv"), header=T)
kable(untidyData)

AirLine	Status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	onTime	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	onTime	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Tidy Data

2.1 Tidy Data - I

Lets tidy the untidy dataset by gathering the destination cities in one Column : ** Destination**.

untidyData <- untidyData %>% gather(Destination, n, Los.Angeles:Seattle)
datatable(untidyData)

2.2 Tidy Data - II

Lets create column names from Categorical data Status. This will make our untidy dataset to Tidy dataset as all varibales will be moved to Columns and Observations into Rows.

tidyData <- untidyData %>% spread(Status, n)
datatable(tidyData)

Analysis

3.1 Probability

Lets add two more columns to our tidy dataset : onTime_Probability, Delayed_Probability

tidyData$onTime_Probability <- round((tidyData$onTime / (tidyData$delayed + tidyData$onTime)), digits = 3)
tidyData$Delayed_Probability <- round((tidyData$delayed / (tidyData$delayed + tidyData$onTime)), digits = 3)
datatable(tidyData)

3.2 Summarize Data

Below we have summarized data on the basis of On time Probability.

b <- data.frame((summary(sqldf('select onTime_Probability from tidyData where AirLine = "ALASKA"'))), summary(sqldf('select onTime_Probability from tidyData where AirLine = "AM WEST"')) )
AL <- str_split_fixed(b$Freq, ":", 2)
AMWest <- str_split_fixed(b$Freq.1, ":", 2)
meanData <- data.frame(AL, AMWest)
meanData$X1.1 <- NULL
colnames(meanData) <- c("Function", "Alaska", "AM West")
kable(meanData)

Function	Alaska	AM West
Min.	0.831	0.7130
1st Qu.	0.858	0.7670
Median	0.889	0.8550
Mean	0.888	0.8224
3rd Qu.	0.914	0.8560
Max.	0.948	0.9210

As seen above, the Alaskan Airlines has better on Time performance than AM West.

Plots

4.1 Scatter Chart

The Scatter Chart below shows that Phoenix has best on time probability for both AirLines and San Francisco has least on time probability for both Airlines.

bsc <- ggplot(tidyData, aes(x = Destination , y = onTime_Probability))  + geom_point(aes(color = onTime_Probability, size = onTime_Probability, shape = factor(AirLine))) +  scale_colour_gradient(low = "purple")
ggplotly(bsc)

4.2 Density Chart

The Density Chart below support our analyses. The Alaskan Airlines has better Ontime performance for almost all the destinations.

dPlot <- qplot(onTime_Probability, data=tidyData, geom='density', color=AirLine, xlim =c(0.50, 1)) + facet_grid(Destination ~.)
ggplotly(dPlot)