Let’s read csv file
adata<-read.csv(file="C://Data/AirlinesData.csv")
adata
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Let’s get rid of a blank row and properly name first 2 columns
Let’s load libraries tidyr and dplyr and then run fill function to populate missing values
adata<-subset(adata,!is.na(Los.Angeles))
names(adata)[1:2]<-c("Airline","Status_flight")
# install.packages(c("tidyr","dplyr"))
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
adata$Airline[adata$Airline==""] <- NA
adata1<-adata %>% fill(Airline)
Now, we can transpose the dataframe. We could also reformat new column “Destination” to replace “.” with " “. We will aslo spread Status_flight column
adata2 <- adata1 %>% gather(Destination, Num_flights, -c("Airline","Status_flight"))
adata2$Destination <- gsub("\\.", " ", adata2$Destination)
adata3 <- adata2 %>% spread(Status_flight, Num_flights)
adata3
## Airline Destination delayed on time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Let’s rename column “on time” and resort our data.
names(adata3)[4]<-c("on_time")
adata3
## Airline Destination delayed on_time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
adata4<-adata3 %>% arrange(Airline, Destination)
adata4
## Airline Destination delayed on_time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Let’s perform our analysis
adata4$ratio<-round(adata4$delayed/(adata4$delayed+adata4$on_time),digits=2)
adata4
## Airline Destination delayed on_time ratio
## 1 ALASKA Los Angeles 62 497 0.11
## 2 ALASKA Phoenix 12 221 0.05
## 3 ALASKA San Diego 20 212 0.09
## 4 ALASKA San Francisco 102 503 0.17
## 5 ALASKA Seattle 305 1841 0.14
## 6 AM WEST Los Angeles 117 694 0.14
## 7 AM WEST Phoenix 415 4840 0.08
## 8 AM WEST San Diego 65 383 0.15
## 9 AM WEST San Francisco 129 320 0.29
## 10 AM WEST Seattle 61 201 0.23
a_sum <- adata4 %>% group_by(Airline) %>% summarize(sumd=sum(delayed),sumon=sum(on_time))
a_sum
## # A tibble: 2 x 3
## Airline sumd sumon
## <fct> <int> <int>
## 1 ALASKA 501 3274
## 2 AM WEST 787 6438
a_sum$ratio <- round(a_sum$sumd/(a_sum$sumon+a_sum$sumd),digits=2)
a_sum
## # A tibble: 2 x 4
## Airline sumd sumon ratio
## <fct> <int> <int> <dbl>
## 1 ALASKA 501 3274 0.13
## 2 AM WEST 787 6438 0.11
Conclusion - Alaska company has more delayed flights than AM West