library(tidyverse)
library(RCurl)
url<-"https://raw.githubusercontent.com/JackJosephWright/data/master/hw5.csv"
dat<-read.csv(url)
lets take a look, I will note in the comments what actions I will take to tidy the data.
#rename a couple columns for ease of use
names(dat)[1]<-"carrier"
names(dat)[2]<-"status"
#remove blank row
dat1<-dat%>%
rowwise()%>%
filter(!is.na(Los.Angeles))
#fill carrier column, fill() is not working I will dig deeper if i have time
dat1$carrier[2]<-"ALASKA"
dat1$carrier[4]<-"AM WEST"
In order for data to be tidy, every column must be a variable and every row must be an observation. On first glance, the destinations look like values of a “destination” column and their values look like “count”.
the “X column looks like values of a”carrier" variable
#converting "destinations" to destination column and their values to "count"
carrier_status<-dat1%>%
pivot_longer(c("Los.Angeles":"Seattle"),names_to="destination",values_to="count")%>%
arrange(destination)
head(carrier_status)
## # A tibble: 6 x 4
## carrier status destination count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
I want to see which carrier supplied more on time flights.
I will turn the counts into percentages, only display the on time results, because the delayed is redundant, and drop the “status and”count" columns.
carrier_stats<-carrier_status%>%
group_by(carrier,destination)%>%
mutate(on_time_rate=count/sum(count))%>%
filter(status=="on time")%>%
select(-status, -count)
head(carrier_stats)
## # A tibble: 6 x 3
## # Groups: carrier, destination [6]
## carrier destination on_time_rate
## <chr> <chr> <dbl>
## 1 ALASKA Los.Angeles 0.889
## 2 AM WEST Los.Angeles 0.856
## 3 ALASKA Phoenix 0.948
## 4 AM WEST Phoenix 0.921
## 5 ALASKA San.Diego 0.914
## 6 AM WEST San.Diego 0.855
I want to see how each carrier did flying to each city. I will make a “dodge” bar plot allowing easy comparison.
carrier_stats%>%
ggplot(aes(fill=carrier,y=on_time_rate,x=destination))+
geom_bar(position="dodge",stat="identity")+coord_flip()
From this plot, it is easy to see Alaska Airlines is better across the board. You can also see that Phoenix has the most on time flights. Further analysis could be done on other datasets to see why Phoenix is so effective.