Dhananjay Kumar
1. Create a .CSV file that includes the Airline information .
2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
3. Perform analysis to compare the arrival delays for the two airlines.
Load CSV file from desktop and validate it using Head.
untidyData <- read.csv(paste0("C:/data/tidyingData.csv"), header=T)
kable(untidyData)
AirLine | Status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | onTime | 497 | 221 | 212 | 503 | 1841 |
ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
AM WEST | onTime | 694 | 4840 | 383 | 320 | 201 |
AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Lets tidy the untidy dataset by gathering the destination cities in one Column : ** Destination**.
untidyData <- untidyData %>% gather(Destination, n, Los.Angeles:Seattle)
datatable(untidyData)
Lets create column names from Categorical data Status. This will make our untidy dataset to Tidy dataset as all varibales will be moved to Columns and Observations into Rows.
tidyData <- untidyData %>% spread(Status, n)
datatable(tidyData)
Lets add two more columns to our tidy dataset : onTime_Probability, Delayed_Probability
tidyData$onTime_Probability <- round((tidyData$onTime / (tidyData$delayed + tidyData$onTime)), digits = 3)
tidyData$Delayed_Probability <- round((tidyData$delayed / (tidyData$delayed + tidyData$onTime)), digits = 3)
datatable(tidyData)
Below we have summarized data on the basis of On time Probability.
b <- data.frame((summary(sqldf('select onTime_Probability from tidyData where AirLine = "ALASKA"'))), summary(sqldf('select onTime_Probability from tidyData where AirLine = "AM WEST"')) )
AL <- str_split_fixed(b$Freq, ":", 2)
AMWest <- str_split_fixed(b$Freq.1, ":", 2)
meanData <- data.frame(AL, AMWest)
meanData$X1.1 <- NULL
colnames(meanData) <- c("Function", "Alaska", "AM West")
kable(meanData)
Function | Alaska | AM West |
---|---|---|
Min. | 0.831 | 0.7130 |
1st Qu. | 0.858 | 0.7670 |
Median | 0.889 | 0.8550 |
Mean | 0.888 | 0.8224 |
3rd Qu. | 0.914 | 0.8560 |
Max. | 0.948 | 0.9210 |
As seen above, the Alaskan Airlines has better on Time performance than AM West.
The Scatter Chart below shows that Phoenix has best on time probability for both AirLines and San Francisco has least on time probability for both Airlines.
bsc <- ggplot(tidyData, aes(x = Destination , y = onTime_Probability)) + geom_point(aes(color = onTime_Probability, size = onTime_Probability, shape = factor(AirLine))) + scale_colour_gradient(low = "purple")
ggplotly(bsc)
The Density Chart below support our analyses. The Alaskan Airlines has better Ontime performance for almost all the destinations.
dPlot <- qplot(onTime_Probability, data=tidyData, geom='density', color=AirLine, xlim =c(0.50, 1)) + facet_grid(Destination ~.)
ggplotly(dPlot)