In this assignment, we will import a .csv table that contains untidy data and transform the table to confirm to tidy concepts. We will then answer some statistical questions regarding the variables in the table to compare the arrival delays for the two airlines.
The following packages are used in the assignment and we are assuming that they are available;
RCurl,
zoo,
tidyr,
dplyr,
ggplot2
## Loading required package: bitops
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For reproducible results, the .csv file (CUNY 607 - Tidy Data Assignment.csv) has been loaded on Github. We will download from GitHub to start the assignment. The file can be found at the following link:
[https://github.com/vbriot28/datascienceCUNY_607/blob/master/CUNY%20607%20-%20Tidy%20Data%20Assignment.csv]
First we will download the file and store it into temp_table. Prior to transforming the table we will view it.
file_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607%20-%20Tidy%20Data%20Assignment.csv")
temp_table <- read.csv(text = file_url)
# View file using "tbl_df" command from dplyr package
tbl_df(temp_table)
## Source: local data frame [5 x 7]
##
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## (fctr) (fctr) (dbl) (fctr) (dbl) (dbl) (fctr)
## 1 ALASKA On-Time 497 221 212 503 1,841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST On-Time 694 4,840 383 320 201
## 5 Delayed 117 415 65 129 61
From this observation, we determine that we will need to modify the content/structure of table as follows:
1. We would need to introduce a name for column 1; Airlines
Since there was no header for this column in the original .csv file, system defaulted name to X, we will rename this column
We would need to introduce a name for column 2; Timetype
Again, there was no header for this column in the original .csv file, the system defaulted to X.1, we will rename this column
The 3rd line of the table need to be removed
This line was a separator in the original table and do not contain any meaningful data and should be removed, we will do this by removing any row with NA entries. We are assuming that such rows were delimiters in original table.
We need to fill-in the appropriate value for airline in the missing row of the table
For each airline, the data is information is written over 2 rows, the first one pertain to on-time flights, the second one to delayed flights. We must extend the airline name from one row to the next one.
# Item# 4, remove any rows with NA as an entry, we will store result in a different data frame temp_table2
temp_table2 <- na.omit(temp_table)
# Item# 5, we need to fill-in column 1 column when "blank" with value from prior row, to do so we will leverage na.locf() from zoo package
temp_table2[temp_table2 == ""]<-NA
temp_table2$X <- na.locf(temp_table2$X)
# Item# 1 & 2, rename column 1 & 2 of data frame with "Airlines" and "TimeType"
names(temp_table2)[1:2] <- c("Airlines", "Timetype")
# Item# 3, column 4 & 7 must be converted from factors to numeric (double), we will first strip the "," and them convert to double to match other columns
# please note that using gsub convert factor to character so we only then have to convert to numeric.
temp_table2$Phoenix <- as.double(gsub(",", "", temp_table2$Phoenix))
temp_table2$Seattle <- as.double(gsub(",", "", temp_table2$Seattle))
# We will re-display the table to confirm all the transformations
tbl_df(temp_table2)
## Source: local data frame [4 x 7]
##
## Airlines Timetype Los.Angeles Phoenix San.Diego San.Francisco Seattle
## (fctr) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 ALASKA On-Time 497 221 212 503 1841
## 2 ALASKA Delayed 62 12 20 102 305
## 3 AM WEST On-Time 694 4840 383 320 201
## 4 AM WEST Delayed 117 415 65 129 61
We will now proceed with “tidying” the table. In order for a table to be tidy, the following must hold true:
1. Each Variable is saved in its own column
2. Each observation is save in its own row
3. Each type of observation is save in its own table
Looking at the table, it is apparent that this table is in a “wide format”, with cities values being used as columns, we will therefore gather these columns under heading of “cities” and introduce a column “Time”. In order to perform this operation, we will leverage gather() function from tidyr package.
We will store the result in a new table; temp_table3. We will view the resulting table after the transformation.
# gather columns 3:7 into single column under heading of "cities" and introduce a column "t_count"
temp_table3 <-gather(temp_table2, cities, t_count, 3:7)
# we will view the resulting table
tbl_df(temp_table3)
## Source: local data frame [20 x 4]
##
## Airlines Timetype cities t_count
## (fctr) (fctr) (chr) (dbl)
## 1 ALASKA On-Time Los.Angeles 497
## 2 ALASKA Delayed Los.Angeles 62
## 3 AM WEST On-Time Los.Angeles 694
## 4 AM WEST Delayed Los.Angeles 117
## 5 ALASKA On-Time Phoenix 221
## 6 ALASKA Delayed Phoenix 12
## 7 AM WEST On-Time Phoenix 4840
## 8 AM WEST Delayed Phoenix 415
## 9 ALASKA On-Time San.Diego 212
## 10 ALASKA Delayed San.Diego 20
## 11 AM WEST On-Time San.Diego 383
## 12 AM WEST Delayed San.Diego 65
## 13 ALASKA On-Time San.Francisco 503
## 14 ALASKA Delayed San.Francisco 102
## 15 AM WEST On-Time San.Francisco 320
## 16 AM WEST Delayed San.Francisco 129
## 17 ALASKA On-Time Seattle 1841
## 18 ALASKA Delayed Seattle 305
## 19 AM WEST On-Time Seattle 201
## 20 AM WEST Delayed Seattle 61
This table is actually in a tidy format. However, for the analysis we have in mind, mainly comparing the arrival delayed times for the two airlines, it would be better to have the On-Time time and Delayed time on 2 separate columns. In order to achieve this result we would need to “spread” the table. This can be acomplished with the spread() function from the tidyr package. Again we will store the result in a separate table; temp_table4. We will view the resulting table after the transformations.
# spread column Timetype into values On-Time and Delayed using column "t_count" for input into cell
temp_table4 <- spread(temp_table3, Timetype, t_count)
# we will view the resulting table
tbl_df(temp_table4)
## Source: local data frame [10 x 4]
##
## Airlines cities Delayed On-Time
## (fctr) (chr) (dbl) (dbl)
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
First we will generate descriptive statistics for the table using the summary function in base R.
summary(temp_table4)
## Airlines cities Delayed On-Time
## :0 Length:10 Min. : 12.00 Min. : 201.0
## ALASKA :5 Class :character 1st Qu.: 61.25 1st Qu.: 245.8
## AM WEST:5 Mode :character Median : 83.50 Median : 440.0
## Mean :128.80 Mean : 971.2
## 3rd Qu.:126.00 3rd Qu.: 646.2
## Max. :415.00 Max. :4840.0
We are interested in the arrival delays for the two airlines. We would first calculate a ratio for each airline, for each city that would correspond to percentage of delays (delays/total arrival, i.e. delays/(delays+on-time)). This calculation will be done using the mutate function in dplyr package. We will store the resulting table in temp_table5. We will review the resulting table after the calculation.
temp_table5 <- mutate(temp_table4, delays_ratio = round(temp_table4$Delayed/(temp_table4$"On-Time"+temp_table4$Delayed), 2))
tbl_df(temp_table5)
## Source: local data frame [10 x 5]
##
## Airlines cities Delayed On-Time delays_ratio
## (fctr) (chr) (dbl) (dbl) (dbl)
## 1 ALASKA Los.Angeles 62 497 0.11
## 2 ALASKA Phoenix 12 221 0.05
## 3 ALASKA San.Diego 20 212 0.09
## 4 ALASKA San.Francisco 102 503 0.17
## 5 ALASKA Seattle 305 1841 0.14
## 6 AM WEST Los.Angeles 117 694 0.14
## 7 AM WEST Phoenix 415 4840 0.08
## 8 AM WEST San.Diego 65 383 0.15
## 9 AM WEST San.Francisco 129 320 0.29
## 10 AM WEST Seattle 61 201 0.23
We will also calculate the mean of ratio for each airlines:
temp_table5 %>% group_by (Airlines) %>%
summarise(mean = mean(delays_ratio))
## Source: local data frame [2 x 2]
##
## Airlines mean
## (fctr) (dbl)
## 1 ALASKA 0.112
## 2 AM WEST 0.178
From this result, we can conclude that AM WEST airline has a higher average percentage of arrivals that are delayed thean ALASKA airline.
We will now graph the ratio of delayed arrivals for each airlines, per city. We will use colors to differentiate one airline from the other. The graphs will be done with ggplot2 package. For graphic function to work correctly, we must convert cities to factor from character.
# Convert cities to factor
temp_table5$cities <- as.factor(temp_table5$cities)
# ungroup the table
temp_table5 %>% ungroup()
## Airlines cities Delayed On-Time delays_ratio
## 1 ALASKA Los.Angeles 62 497 0.11
## 2 ALASKA Phoenix 12 221 0.05
## 3 ALASKA San.Diego 20 212 0.09
## 4 ALASKA San.Francisco 102 503 0.17
## 5 ALASKA Seattle 305 1841 0.14
## 6 AM WEST Los.Angeles 117 694 0.14
## 7 AM WEST Phoenix 415 4840 0.08
## 8 AM WEST San.Diego 65 383 0.15
## 9 AM WEST San.Francisco 129 320 0.29
## 10 AM WEST Seattle 61 201 0.23
# Display cleveland plot
ggplot(temp_table5, aes(x=delays_ratio, y=reorder(cities, delays_ratio), colour = Airlines)) + geom_point(size=3) + ggtitle("Cleveland Dot Plot for % Delayed Arrivals per Cities") + labs(x="% Delayed Arrivals", y="Arrival Cities")
# Display line plot
ggplot(temp_table5, aes(x=reorder(cities, delays_ratio), y=delays_ratio, colour = Airlines, group = Airlines)) + geom_line(linetype = "dashed") + geom_point(shape = 22, size = 3, fill = "white") + ggtitle("Line Graph for % Delayed Arrivals per Cities") + labs(x="Arrival Cities", y="% Delayed Arrivals")
From both these graphs, % Delayed Arrivals is much greater fro AM WEST airlines for each observations. We also noticed that busy airports like San Fracisco, Seattle, and Los Angeles experienced more delayed than San Diego and Phoenix. However, we out additional information it would be difficult to determine why. We should noticed that for AM WEST, the % delayed arrivals for Los Angeles much less than what might be expected judging by the graphs pattern. Further investigation would be required.