Tidying and Transforming Data

Assignment Description:

For week 6 we are given a dataset for two airlines and their on-time and delayed numbers for five cities. The dataset is in wide table format and we will have to use dplyr and tidyr to analyze the dataset and do a comparision between the two airlines AM WEST and ALASKA.

My Figure

My Figure

Setup

To do the analysis, I first added the information in the figure above to a CSV file, which can be downloaded from : [link] https://github.com/ravi-kothari/DATA-607/blob/master/Assignment6.csv

Also we will need the following package to do the analysis:

  • dplyr

  • tidyr

  • Downloader

  • ggplot2(optional)

install.packages("tidyr")
install.packages("dplyr")
library("tidyr")
library("dplyr")
library("Downloader")

Solution:

Step 1: Loading the dataset.

We created a dataframe df and loaded the dataset from the csv file. Assuming you downloaded the csv file and added it to the current working directory.

getwd()
## [1] "C:/Users/rkothari/Documents/MSDA/DATA 607"
# load dataset into a dataframe.

df <- read.csv("~/MSDA/DATA 607/Assignment6.csv",header=TRUE,stringsAsFactors = FALSE)
df
##         X     X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2  ALASKA delayed          62      12        20          102     305
## 3 AM WEST on time         694    4840       383          320     201
## 4 AM WEST delayed         117     415        65          129      61

Step 2: Tidy and Transform data using tidyr and dplyr:

We will convert the wide data format to long format first using function gather() in tidyr package.

# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)

library("tidyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df #wide format
##         X     X.1 Los.Angeles Phoenix San.Diego San.Franciso Seattle
## 1  ALASKA on time         497     221       212          503    1841
## 2  ALASKA delayed          62      12        20          102     305
## 3 AM WEST on time         694    4840       383          320     201
## 4 AM WEST delayed         117     415        65          129      61
data_long <- gather(df, city, number, Los.Angeles:Seattle, factor_key=TRUE)

# rename column name.
colnames(data_long)[2] <- "arrival_delay" 
colnames(data_long)[1] <- "Airline"
data_long$Airline <- factor(data_long$Airline)
data_long
##    Airline arrival_delay         city number
## 1   ALASKA       on time  Los.Angeles    497
## 2   ALASKA       delayed  Los.Angeles     62
## 3  AM WEST       on time  Los.Angeles    694
## 4  AM WEST       delayed  Los.Angeles    117
## 5   ALASKA       on time      Phoenix    221
## 6   ALASKA       delayed      Phoenix     12
## 7  AM WEST       on time      Phoenix   4840
## 8  AM WEST       delayed      Phoenix    415
## 9   ALASKA       on time    San.Diego    212
## 10  ALASKA       delayed    San.Diego     20
## 11 AM WEST       on time    San.Diego    383
## 12 AM WEST       delayed    San.Diego     65
## 13  ALASKA       on time San.Franciso    503
## 14  ALASKA       delayed San.Franciso    102
## 15 AM WEST       on time San.Franciso    320
## 16 AM WEST       delayed San.Franciso    129
## 17  ALASKA       on time      Seattle   1841
## 18  ALASKA       delayed      Seattle    305
## 19 AM WEST       on time      Seattle    201
## 20 AM WEST       delayed      Seattle     61
# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values

clean_data <- spread(data_long, "arrival_delay", "number") 

is.factor(clean_data$Airline)
## [1] TRUE
is.numeric(clean_data$delayed)
## [1] TRUE
is.numeric(clean_data$`on time`)
## [1] TRUE
is.factor(clean_data$city)
## [1] TRUE
clean_data #final tidy data to do the analysis using dplyr package.
##    Airline         city delayed on time
## 1   ALASKA  Los.Angeles      62     497
## 2   ALASKA      Phoenix      12     221
## 3   ALASKA    San.Diego      20     212
## 4   ALASKA San.Franciso     102     503
## 5   ALASKA      Seattle     305    1841
## 6  AM WEST  Los.Angeles     117     694
## 7  AM WEST      Phoenix     415    4840
## 8  AM WEST    San.Diego      65     383
## 9  AM WEST San.Franciso     129     320
## 10 AM WEST      Seattle      61     201

spread() returns a copy of your data set that has had the key and value columns removed. In their place, spread() adds a new column for each unique value of the key column. These unique values will form the column names of the new columns. spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.

Step 3: Data Analysis

After reshaping the data we used dplyr’s functions to aggregate the values and calculated the percent of on time numbers then summarized them to find which airlines performed better.

# Using pipe operator of dplyr to calculate the below summary
analysis <- as.data.frame(clean_data %>%
  mutate(total = delayed + `on time`) %>%
  group_by(Airline) %>%
  summarise(delayed_total = sum(delayed), 
            on_time_total = sum(`on time`),
            count_total = sum(total),
            percent_ontime = round(on_time_total/count_total*100,digit = 2)))
analysis
##   Airline delayed_total on_time_total count_total percent_ontime
## 1  ALASKA           501          3274        3775          86.73
## 2 AM WEST           787          6438        7225          89.11

So from the analysis we can see that AM WEST airlines has better percentage of on time arrival than Alaskan airlines. We can further measure the performance of the airlines by summarizing the data based on the cities.