Assignment – Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

 The URL to the .Rmd file in your GitHub repository. and

 The URL for your rpubs.com web page.

library("readxl")
library("tidyr")
library("dplyr")
library("Hmisc")
library("ggplot2")

#Read from csv file

airlinesData_csv <- read.csv("https://raw.githubusercontent.com/theoracley/Data607/master/Assignment5/TwoAirlines.csv", header=TRUE, sep=",")

#airlinesData_csv <- read.csv("TwoAirlines.csv", header=TRUE, sep=",")
#airlinesData_csv

#Find the datatype of AirlinesData_df first
typeof(airlinesData_csv)
## [1] "list"
# then Convert to data frame
AirlinesData_df <- data.frame(airlinesData_csv)

let’s go to work

#Let's look at our data and start cleaning up from here
AirlinesData_df
##         X     X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2         delayed          62      12        20          102     305
## 3                          NA      NA        NA           NA      NA
## 4 AM WEST on time         694    4840       383          320     201
## 5         delayed         117     415        65          129      61
#Rename the weired column to something meaningful
AirlinesData_df <- AirlinesData_df %>%
  rename(airline = X, status = X.1, los_angeles = Los.Angeles, phoenix = Phoenix, san_diego = San.Diego, san_franciso = San.Franciso, seattle = Seattle)
AirlinesData_df
##   airline  status los_angeles phoenix san_diego san_franciso seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2         delayed          62      12        20          102     305
## 3                          NA      NA        NA           NA      NA
## 4 AM WEST on time         694    4840       383          320     201
## 5         delayed         117     415        65          129      61
#Remove NA rows
AirlinesData_df <- AirlinesData_df[-c(3),] 
AirlinesData_df
##   airline  status los_angeles phoenix san_diego san_franciso seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2         delayed          62      12        20          102     305
## 4 AM WEST on time         694    4840       383          320     201
## 5         delayed         117     415        65          129      61
#Add missing values
AirlinesData_df[2, "airline"] <- "ALASKA"
AirlinesData_df[4, "airline"] <- "AM WEST"

#now we can reset our index
rownames(AirlinesData_df) <- NULL

#Adding factor
AirlinesData_df$airline <- factor(AirlinesData_df$airline)

AirlinesData_df
##   airline  status los_angeles phoenix san_diego san_franciso seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2  ALASKA delayed          62      12        20          102     305
## 3 AM WEST on time         694    4840       383          320     201
## 4 AM WEST delayed         117     415        65          129      61
#What are the column names now
names(AirlinesData_df)
## [1] "airline"      "status"       "los_angeles"  "phoenix"     
## [5] "san_diego"    "san_franciso" "seattle"
#what datatypes do we have now
sapply(AirlinesData_df, typeof)
##      airline       status  los_angeles      phoenix    san_diego 
##    "integer"    "integer"    "integer"    "integer"    "integer" 
## san_franciso      seattle 
##    "integer"    "integer"
#structure of our data
describe(AirlinesData_df)
## AirlinesData_df 
## 
##  7  Variables      4  Observations
## ---------------------------------------------------------------------------
## airline 
##        n  missing distinct 
##        4        0        2 
##                           
## Value       ALASKA AM WEST
## Frequency        2       2
## Proportion     0.5     0.5
## ---------------------------------------------------------------------------
## status 
##        n  missing distinct 
##        4        0        2 
##                           
## Value      delayed on time
## Frequency        2       2
## Proportion     0.5     0.5
## ---------------------------------------------------------------------------
## los_angeles 
##        n  missing distinct     Info     Mean      Gmd 
##        4        0        4        1    342.5    379.3 
##                               
## Value        62  117  497  694
## Frequency     1    1    1    1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## phoenix 
##        n  missing distinct     Info     Mean      Gmd 
##        4        0        4        1     1372     2446 
##                               
## Value        12  221  415 4840
## Frequency     1    1    1    1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## san_diego 
##        n  missing distinct     Info     Mean      Gmd 
##        4        0        4        1      170      206 
##                               
## Value        20   65  212  383
## Frequency     1    1    1    1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## san_franciso 
##        n  missing distinct     Info     Mean      Gmd 
##        4        0        4        1    263.5    232.3 
##                               
## Value       102  129  320  503
## Frequency     1    1    1    1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## seattle 
##        n  missing distinct     Info     Mean      Gmd 
##        4        0        4        1      602    907.3 
##                               
## Value        61  201  305 1841
## Frequency     1    1    1    1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------

Store our cleaned data into a csv file

write.csv(AirlinesData_df, "./cleanAirlines.csv", row.names=FALSE)

#store our data into a temp dataframe
tempAirlines <- AirlinesData_df
tempAirlines
##   airline  status los_angeles phoenix san_diego san_franciso seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2  ALASKA delayed          62      12        20          102     305
## 3 AM WEST on time         694    4840       383          320     201
## 4 AM WEST delayed         117     415        65          129      61

Analyze this!!

#gathering these columns
names(tempAirlines)[3:7]
## [1] "los_angeles"  "phoenix"      "san_diego"    "san_franciso"
## [5] "seattle"
#Apply and view output
airLines_analysis1 <- gather(tempAirlines, "city", "num", names(tempAirlines)[3:7], factor_key=TRUE)
airLines_analysis1
##    airline  status         city  num
## 1   ALASKA on time  los_angeles  497
## 2   ALASKA delayed  los_angeles   62
## 3  AM WEST on time  los_angeles  694
## 4  AM WEST delayed  los_angeles  117
## 5   ALASKA on time      phoenix  221
## 6   ALASKA delayed      phoenix   12
## 7  AM WEST on time      phoenix 4840
## 8  AM WEST delayed      phoenix  415
## 9   ALASKA on time    san_diego  212
## 10  ALASKA delayed    san_diego   20
## 11 AM WEST on time    san_diego  383
## 12 AM WEST delayed    san_diego   65
## 13  ALASKA on time san_franciso  503
## 14  ALASKA delayed san_franciso  102
## 15 AM WEST on time san_franciso  320
## 16 AM WEST delayed san_franciso  129
## 17  ALASKA on time      seattle 1841
## 18  ALASKA delayed      seattle  305
## 19 AM WEST on time      seattle  201
## 20 AM WEST delayed      seattle   61
airLines_analysis1 <-  airLines_analysis1 %>%
                 spread(status, `num`)
airLines_analysis1
##    airline         city delayed on time
## 1   ALASKA  los_angeles      62     497
## 2   ALASKA      phoenix      12     221
## 3   ALASKA    san_diego      20     212
## 4   ALASKA san_franciso     102     503
## 5   ALASKA      seattle     305    1841
## 6  AM WEST  los_angeles     117     694
## 7  AM WEST      phoenix     415    4840
## 8  AM WEST    san_diego      65     383
## 9  AM WEST san_franciso     129     320
## 10 AM WEST      seattle      61     201
airLines_analysis1 <- airLines_analysis1 %>% 
  mutate(total = delayed + `on time`, percent_delayed = delayed/ total, percent_ontime = `on time`/ total) 
airLines_analysis1
##    airline         city delayed on time total percent_delayed
## 1   ALASKA  los_angeles      62     497   559      0.11091234
## 2   ALASKA      phoenix      12     221   233      0.05150215
## 3   ALASKA    san_diego      20     212   232      0.08620690
## 4   ALASKA san_franciso     102     503   605      0.16859504
## 5   ALASKA      seattle     305    1841  2146      0.14212488
## 6  AM WEST  los_angeles     117     694   811      0.14426634
## 7  AM WEST      phoenix     415    4840  5255      0.07897241
## 8  AM WEST    san_diego      65     383   448      0.14508929
## 9  AM WEST san_franciso     129     320   449      0.28730512
## 10 AM WEST      seattle      61     201   262      0.23282443
##    percent_ontime
## 1       0.8890877
## 2       0.9484979
## 3       0.9137931
## 4       0.8314050
## 5       0.8578751
## 6       0.8557337
## 7       0.9210276
## 8       0.8549107
## 9       0.7126949
## 10      0.7671756
ggplot(data=airLines_analysis1, aes(x=airline, y=percent_ontime)) +
  geom_bar(stat="identity", width=0.5, fill="#52823f") +
  xlab("Airlines") + ylab("On Time") +
  ggtitle("On Time")

ggplot(data=airLines_analysis1, aes(x=airline, y=percent_delayed)) +
  geom_bar(stat="identity", width=0.5, fill="#82423f") +
  xlab("Airlines") + ylab("Delays") +
  ggtitle("Delays")

we observe that AM West has more delays and less on time flights than ALASKA.

ggplot(data=airLines_analysis1, aes(x = airline, y=percent_ontime, fill = city)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("On Time") + ggtitle("On Time per Airport")

ggplot(data=airLines_analysis1, aes(x = airline, y=percent_delayed, fill = city)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Delays") + ggtitle("On Time per Airport")

When it comes to major airports, there are more delays there. Might be in these major airports, AM WEST conducts more flights than in ALASKA. Let’s check our assumption graphically, shall we?

#Apply and view output
airLines_analysis2 <- gather(tempAirlines, "city", "num", names(tempAirlines)[3:7], factor_key=TRUE)
airLines_analysis2
##    airline  status         city  num
## 1   ALASKA on time  los_angeles  497
## 2   ALASKA delayed  los_angeles   62
## 3  AM WEST on time  los_angeles  694
## 4  AM WEST delayed  los_angeles  117
## 5   ALASKA on time      phoenix  221
## 6   ALASKA delayed      phoenix   12
## 7  AM WEST on time      phoenix 4840
## 8  AM WEST delayed      phoenix  415
## 9   ALASKA on time    san_diego  212
## 10  ALASKA delayed    san_diego   20
## 11 AM WEST on time    san_diego  383
## 12 AM WEST delayed    san_diego   65
## 13  ALASKA on time san_franciso  503
## 14  ALASKA delayed san_franciso  102
## 15 AM WEST on time san_franciso  320
## 16 AM WEST delayed san_franciso  129
## 17  ALASKA on time      seattle 1841
## 18  ALASKA delayed      seattle  305
## 19 AM WEST on time      seattle  201
## 20 AM WEST delayed      seattle   61
ggplot(data=airLines_analysis2, aes(x = airline, y=num, fill = city)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Number of Flights") + ggtitle("Number of Flights per Airport")

We can see clearly that AM West is running few flights in the major airport, but has more delays than ALASKA. So our previous assumption was totally wrong.

Finally, I think ALASKA has a better flight records than AM WEST because of other reasons. Perhaps, ALASKA has a better reservation system, a better flight maintenance, or a better operating hub. We cannot tell from the small dataset we have. We need a big dataset with lot of of observations and lot of columns. For now, our small dataset is not enough to conclude anything, and or derive any interesting insights.