For week 6 we are given a dataset for two airlines and their on-time and delayed numbers for five cities. The dataset is in wide table format and we will have to use dplyr and tidyr to analyze the dataset and do a comparision between the two airlines AM WEST and ALASKA.
My Figure
To do the analysis, I first added the information in the figure above to a CSV file, which can be downloaded from : [link] https://github.com/ravi-kothari/DATA-607/blob/master/Assignment6.csv
Also we will need the following package to do the analysis:
dplyr
tidyr
Downloader
ggplot2(optional)
install.packages("tidyr")
install.packages("dplyr")
library("tidyr")
library("dplyr")
library("Downloader")
We created a dataframe df and loaded the dataset from the csv file. Assuming you downloaded the csv file and added it to the current working directory.
getwd()
## [1] "C:/Users/rkothari/Documents/MSDA/DATA 607"
# load dataset into a dataframe.
df <- read.csv("~/MSDA/DATA 607/Assignment6.csv",header=TRUE,stringsAsFactors = FALSE)
df
## X X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
We will convert the wide data format to long format first using function gather() in tidyr package.
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df #wide format
## X X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
data_long <- gather(df, city, number, Los.Angeles:Seattle, factor_key=TRUE)
# rename column name.
colnames(data_long)[2] <- "arrival_delay"
colnames(data_long)[1] <- "Airline"
data_long$Airline <- factor(data_long$Airline)
data_long
## Airline arrival_delay city number
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Franciso 503
## 14 ALASKA delayed San.Franciso 102
## 15 AM WEST on time San.Franciso 320
## 16 AM WEST delayed San.Franciso 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values
clean_data <- spread(data_long, "arrival_delay", "number")
is.factor(clean_data$Airline)
## [1] TRUE
is.numeric(clean_data$delayed)
## [1] TRUE
is.numeric(clean_data$`on time`)
## [1] TRUE
is.factor(clean_data$city)
## [1] TRUE
clean_data #final tidy data to do the analysis using dplyr package.
## Airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Franciso 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Franciso 129 320
## 10 AM WEST Seattle 61 201
spread() returns a copy of your data set that has had the key and value columns removed. In their place, spread() adds a new column for each unique value of the key column. These unique values will form the column names of the new columns. spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
After reshaping the data we used dplyr’s functions to aggregate the values and calculated the percent of on time numbers then summarized them to find which airlines performed better.
# Using pipe operator of dplyr to calculate the below summary
analysis <- as.data.frame(clean_data %>%
mutate(total = delayed + `on time`) %>%
group_by(Airline) %>%
summarise(delayed_total = sum(delayed),
on_time_total = sum(`on time`),
count_total = sum(total),
percent_ontime = round(on_time_total/count_total*100,digit = 2)))
analysis
## Airline delayed_total on_time_total count_total percent_ontime
## 1 ALASKA 501 3274 3775 86.73
## 2 AM WEST 787 6438 7225 89.11
So from the analysis we can see that AM WEST airlines has better percentage of on time arrival than Alaskan airlines. We can further measure the performance of the airlines by summarizing the data based on the cities.