Read csv and transform to have shape nx4. The dataset can be found here
df <- read.csv('https://raw.githubusercontent.com/ksooklall/CUNY-SPS-Masters-DS/main/DATA_607/homework/homework5/flightdata.csv', sep=',')
colnames(df) <- c('airline', 'status', 'los_angeles', 'phoenix', 'san_diego', 'san_francisco', 'seattle')
df <- df %>% mutate_all(list(~na_if(.,''))) %>% fill(airline)
df <- df %>% pivot_longer(!c('airline', 'status'), names_to='location', values_to='count')
df %>% group_by(airline, status) %>% summarise(total=sum(count), .groups='drop') %>% ggplot(aes(x=airline, y=total, color=status)) + geom_point(aes(size=total))
It looks like AM WEST has more on time flights.
wordcloud(words=df$location, freq=df$count, color = 'blue', size=1, shape="rectangle", backgroundColor="white")
From the word cloud Phoenix is by far the most popular destination.
df %>% group_by(location, status) %>% summarise(total=sum(count), .groups='drop') %>% ggplot(aes(x=location, y=total, fill=status)) + geom_col()
From the bar plot it can be seen that San Diego has the least delays.
df[,c('airline', 'count')] %>% ggplot(aes(x=airline, y=count)) + geom_boxplot()
Both airlines share a close median however AM WEST has a large outliar while ALASKA has a larger IQR range