The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository. and
The URL for your rpubs.com web page.
library("readxl")
library("tidyr")
library("dplyr")
library("Hmisc")
library("ggplot2")
#Read from csv file
airlinesData_csv <- read.csv("https://raw.githubusercontent.com/theoracley/Data607/master/Assignment5/TwoAirlines.csv", header=TRUE, sep=",")
#airlinesData_csv <- read.csv("TwoAirlines.csv", header=TRUE, sep=",")
#airlinesData_csv
#Find the datatype of AirlinesData_df first
typeof(airlinesData_csv)
## [1] "list"
# then Convert to data frame
AirlinesData_df <- data.frame(airlinesData_csv)
#Let's look at our data and start cleaning up from here
AirlinesData_df
## X X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Rename the weired column to something meaningful
AirlinesData_df <- AirlinesData_df %>%
rename(airline = X, status = X.1, los_angeles = Los.Angeles, phoenix = Phoenix, san_diego = San.Diego, san_franciso = San.Franciso, seattle = Seattle)
AirlinesData_df
## airline status los_angeles phoenix san_diego san_franciso seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Remove NA rows
AirlinesData_df <- AirlinesData_df[-c(3),]
AirlinesData_df
## airline status los_angeles phoenix san_diego san_franciso seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Add missing values
AirlinesData_df[2, "airline"] <- "ALASKA"
AirlinesData_df[4, "airline"] <- "AM WEST"
#now we can reset our index
rownames(AirlinesData_df) <- NULL
#Adding factor
AirlinesData_df$airline <- factor(AirlinesData_df$airline)
AirlinesData_df
## airline status los_angeles phoenix san_diego san_franciso seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
#What are the column names now
names(AirlinesData_df)
## [1] "airline" "status" "los_angeles" "phoenix"
## [5] "san_diego" "san_franciso" "seattle"
#what datatypes do we have now
sapply(AirlinesData_df, typeof)
## airline status los_angeles phoenix san_diego
## "integer" "integer" "integer" "integer" "integer"
## san_franciso seattle
## "integer" "integer"
#structure of our data
describe(AirlinesData_df)
## AirlinesData_df
##
## 7 Variables 4 Observations
## ---------------------------------------------------------------------------
## airline
## n missing distinct
## 4 0 2
##
## Value ALASKA AM WEST
## Frequency 2 2
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## status
## n missing distinct
## 4 0 2
##
## Value delayed on time
## Frequency 2 2
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## los_angeles
## n missing distinct Info Mean Gmd
## 4 0 4 1 342.5 379.3
##
## Value 62 117 497 694
## Frequency 1 1 1 1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## phoenix
## n missing distinct Info Mean Gmd
## 4 0 4 1 1372 2446
##
## Value 12 221 415 4840
## Frequency 1 1 1 1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## san_diego
## n missing distinct Info Mean Gmd
## 4 0 4 1 170 206
##
## Value 20 65 212 383
## Frequency 1 1 1 1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## san_franciso
## n missing distinct Info Mean Gmd
## 4 0 4 1 263.5 232.3
##
## Value 102 129 320 503
## Frequency 1 1 1 1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## seattle
## n missing distinct Info Mean Gmd
## 4 0 4 1 602 907.3
##
## Value 61 201 305 1841
## Frequency 1 1 1 1
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
Store our cleaned data into a csv file
write.csv(AirlinesData_df, "./cleanAirlines.csv", row.names=FALSE)
#store our data into a temp dataframe
tempAirlines <- AirlinesData_df
tempAirlines
## airline status los_angeles phoenix san_diego san_franciso seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Analyze this!!
#gathering these columns
names(tempAirlines)[3:7]
## [1] "los_angeles" "phoenix" "san_diego" "san_franciso"
## [5] "seattle"
#Apply and view output
airLines_analysis1 <- gather(tempAirlines, "city", "num", names(tempAirlines)[3:7], factor_key=TRUE)
airLines_analysis1
## airline status city num
## 1 ALASKA on time los_angeles 497
## 2 ALASKA delayed los_angeles 62
## 3 AM WEST on time los_angeles 694
## 4 AM WEST delayed los_angeles 117
## 5 ALASKA on time phoenix 221
## 6 ALASKA delayed phoenix 12
## 7 AM WEST on time phoenix 4840
## 8 AM WEST delayed phoenix 415
## 9 ALASKA on time san_diego 212
## 10 ALASKA delayed san_diego 20
## 11 AM WEST on time san_diego 383
## 12 AM WEST delayed san_diego 65
## 13 ALASKA on time san_franciso 503
## 14 ALASKA delayed san_franciso 102
## 15 AM WEST on time san_franciso 320
## 16 AM WEST delayed san_franciso 129
## 17 ALASKA on time seattle 1841
## 18 ALASKA delayed seattle 305
## 19 AM WEST on time seattle 201
## 20 AM WEST delayed seattle 61
airLines_analysis1 <- airLines_analysis1 %>%
spread(status, `num`)
airLines_analysis1
## airline city delayed on time
## 1 ALASKA los_angeles 62 497
## 2 ALASKA phoenix 12 221
## 3 ALASKA san_diego 20 212
## 4 ALASKA san_franciso 102 503
## 5 ALASKA seattle 305 1841
## 6 AM WEST los_angeles 117 694
## 7 AM WEST phoenix 415 4840
## 8 AM WEST san_diego 65 383
## 9 AM WEST san_franciso 129 320
## 10 AM WEST seattle 61 201
airLines_analysis1 <- airLines_analysis1 %>%
mutate(total = delayed + `on time`, percent_delayed = delayed/ total, percent_ontime = `on time`/ total)
airLines_analysis1
## airline city delayed on time total percent_delayed
## 1 ALASKA los_angeles 62 497 559 0.11091234
## 2 ALASKA phoenix 12 221 233 0.05150215
## 3 ALASKA san_diego 20 212 232 0.08620690
## 4 ALASKA san_franciso 102 503 605 0.16859504
## 5 ALASKA seattle 305 1841 2146 0.14212488
## 6 AM WEST los_angeles 117 694 811 0.14426634
## 7 AM WEST phoenix 415 4840 5255 0.07897241
## 8 AM WEST san_diego 65 383 448 0.14508929
## 9 AM WEST san_franciso 129 320 449 0.28730512
## 10 AM WEST seattle 61 201 262 0.23282443
## percent_ontime
## 1 0.8890877
## 2 0.9484979
## 3 0.9137931
## 4 0.8314050
## 5 0.8578751
## 6 0.8557337
## 7 0.9210276
## 8 0.8549107
## 9 0.7126949
## 10 0.7671756
ggplot(data=airLines_analysis1, aes(x=airline, y=percent_ontime)) +
geom_bar(stat="identity", width=0.5, fill="#52823f") +
xlab("Airlines") + ylab("On Time") +
ggtitle("On Time")
ggplot(data=airLines_analysis1, aes(x=airline, y=percent_delayed)) +
geom_bar(stat="identity", width=0.5, fill="#82423f") +
xlab("Airlines") + ylab("Delays") +
ggtitle("Delays")
we observe that AM West has more delays and less on time flights than ALASKA.
ggplot(data=airLines_analysis1, aes(x = airline, y=percent_ontime, fill = city)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("On Time") + ggtitle("On Time per Airport")
ggplot(data=airLines_analysis1, aes(x = airline, y=percent_delayed, fill = city)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Delays") + ggtitle("On Time per Airport")
When it comes to major airports, there are more delays there. Might be in these major airports, AM WEST conducts more flights than in ALASKA. Let’s check our assumption graphically, shall we?
#Apply and view output
airLines_analysis2 <- gather(tempAirlines, "city", "num", names(tempAirlines)[3:7], factor_key=TRUE)
airLines_analysis2
## airline status city num
## 1 ALASKA on time los_angeles 497
## 2 ALASKA delayed los_angeles 62
## 3 AM WEST on time los_angeles 694
## 4 AM WEST delayed los_angeles 117
## 5 ALASKA on time phoenix 221
## 6 ALASKA delayed phoenix 12
## 7 AM WEST on time phoenix 4840
## 8 AM WEST delayed phoenix 415
## 9 ALASKA on time san_diego 212
## 10 ALASKA delayed san_diego 20
## 11 AM WEST on time san_diego 383
## 12 AM WEST delayed san_diego 65
## 13 ALASKA on time san_franciso 503
## 14 ALASKA delayed san_franciso 102
## 15 AM WEST on time san_franciso 320
## 16 AM WEST delayed san_franciso 129
## 17 ALASKA on time seattle 1841
## 18 ALASKA delayed seattle 305
## 19 AM WEST on time seattle 201
## 20 AM WEST delayed seattle 61
ggplot(data=airLines_analysis2, aes(x = airline, y=num, fill = city)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Number of Flights") + ggtitle("Number of Flights per Airport")
We can see clearly that AM West is running few flights in the major airport, but has more delays than ALASKA. So our previous assumption was totally wrong.