The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission: ??? The URL to the .Rmd file in your GitHub repository. and ??? The URL for your rpubs.com web page.
library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Reading the csv file:
airlines <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607_Week5_Assignment_Tidying_Transforming_Data/master/Numbersense.csv", header=FALSE, sep=",", stringsAsFactors=FALSE)
airlines
## V1 V2 V3 V4 V5 V6 V7
## 1 Los Angeles Phoenix San Diego San Fransisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
Data cleaning:
Renaming the columns:
airlines$V1[1] <- "airline"
airlines$V2[1] <- "status"
airlines
## V1 V2 V3 V4 V5 V6 V7
## 1 airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
Making the row #1 as column name:
names(airlines) <- airlines[1,]
airlines
## airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 1 airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
Delete row #1 and 4:
airlines <- airlines[-c(1,4), ]
airlines
## airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
Addind Airline names to the missing records:
airlines$airline[2] = 'ALASKA'
airlines$airline[4] = 'AM WEST'
Transforming the table by using gather():
Gather: takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.
airlines <-gather(airlines, city, number, 3:7, factor_key=TRUE)
airlines
## airline status city number
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time San Fransisco 503
## 14 ALASKA delayed San Fransisco 102
## 15 AM WEST on time San Fransisco 320
## 16 AM WEST delayed San Fransisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Using spread() function to separate the data on Status:
Spread: Spread a key-value pair across multiple columns.
airlines <- airlines %>%
spread(status, `number`)
airlines
## airline city delayed on time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Fransisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Fransisco 129 320
## 10 AM WEST Seattle 61 201
for (i in 3:dim(airlines)[2]){
airlines[,i] <- as.integer(airlines[,i])
}
airlines
## airline city delayed on time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Fransisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Fransisco 129 320
## 10 AM WEST Seattle 61 201
Using mutate for adding new columns like total flights, % on time and % delays:
Mutate: Mutate adds new variables and preserves existing
airlines <- airlines %>%
mutate(total = delayed + `on time`)
airlines
## airline city delayed on time total
## 1 ALASKA Los Angeles 62 497 559
## 2 ALASKA Phoenix 12 221 233
## 3 ALASKA San Diego 20 212 232
## 4 ALASKA San Fransisco 102 503 605
## 5 ALASKA Seattle 305 1841 2146
## 6 AM WEST Los Angeles 117 694 811
## 7 AM WEST Phoenix 415 4840 5255
## 8 AM WEST San Diego 65 383 448
## 9 AM WEST San Fransisco 129 320 449
## 10 AM WEST Seattle 61 201 262
airlines <- airlines %>%
mutate(on_time_percentage = (`on time`/total)*100,
delayed_percentage = (delayed/total)*100) %>%
arrange(city)
airlines
## airline city delayed on time total on_time_percentage
## 1 ALASKA Los Angeles 62 497 559 88.90877
## 2 AM WEST Los Angeles 117 694 811 85.57337
## 3 ALASKA Phoenix 12 221 233 94.84979
## 4 AM WEST Phoenix 415 4840 5255 92.10276
## 5 ALASKA San Diego 20 212 232 91.37931
## 6 AM WEST San Diego 65 383 448 85.49107
## 7 ALASKA San Fransisco 102 503 605 83.14050
## 8 AM WEST San Fransisco 129 320 449 71.26949
## 9 ALASKA Seattle 305 1841 2146 85.78751
## 10 AM WEST Seattle 61 201 262 76.71756
## delayed_percentage
## 1 11.091234
## 2 14.426634
## 3 5.150215
## 4 7.897241
## 5 8.620690
## 6 14.508929
## 7 16.859504
## 8 28.730512
## 9 14.212488
## 10 23.282443
Using ggplot to understand more:
ggplot(airlines, aes(x = airline, y=delayed_percentage, fill = city)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Delays")
To calculate the average:
airlines_summary = airlines %>% group_by(airline) %>% summarise(total_delayed = sum(delayed), total_ontime = sum(`total`), avg_delay = total_delayed/total_ontime*100)
airlines_summary
## # A tibble: 2 x 4
## airline total_delayed total_ontime avg_delay
## <chr> <int> <int> <dbl>
## 1 ALASKA 501 3775 13.3
## 2 AM WEST 787 7225 10.9
Calculate the city totals:
airlines %>%
group_by(city) %>%
summarize(count = n(), sum_city = sum(total))
## # A tibble: 5 x 3
## city count sum_city
## <fct> <int> <int>
## 1 Los Angeles 2 1370
## 2 Phoenix 2 5488
## 3 San Diego 2 680
## 4 San Fransisco 2 1054
## 5 Seattle 2 2408
Boxplot to check the outlier:
ggplot(airlines, aes(airline, total)) + geom_boxplot()