KSooklall_Homework5 DATA-607

Read in the data

Read csv and transform to have shape nx4. The dataset can be found here

Airline - ALASKA or AM WEST
Status - On time or Delayed
Location - Destination of flight
Count - The number of flights to a certain location

df <- read.csv('https://raw.githubusercontent.com/ksooklall/CUNY-SPS-Masters-DS/main/DATA_607/homework/homework5/flightdata.csv', sep=',')
colnames(df) <- c('airline', 'status', 'los_angeles', 'phoenix', 'san_diego', 'san_francisco', 'seattle')
df <- df %>% mutate_all(list(~na_if(.,''))) %>% fill(airline)
df <- df %>% pivot_longer(!c('airline', 'status'), names_to='location', values_to='count')

Which airline is more dependable, ie which one was on time more?

df %>% group_by(airline, status) %>% summarise(total=sum(count), .groups='drop') %>% ggplot(aes(x=airline, y=total, color=status)) + geom_point(aes(size=total))

It looks like AM WEST has more on time flights.

Which location was most popular?

wordcloud(words=df$location, freq=df$count, color = 'blue', size=1, shape="rectangle", backgroundColor="white")

From the word cloud Phoenix is by far the most popular destination.

df %>% group_by(location, status) %>% summarise(total=sum(count), .groups='drop') %>% ggplot(aes(x=location, y=total, fill=status)) + geom_col()

From the bar plot it can be seen that San Diego has the least delays.

How do the two airlines compare?

df[,c('airline', 'count')] %>% ggplot(aes(x=airline, y=count)) + geom_boxplot()

Both airlines share a close median however AM WEST has a large outliar while ALASKA has a larger IQR range

KSooklall_Homework5 DATA-607

Kenan Sooklall

3/2/2021

Read in the data

Which airline is more dependable, ie which one was on time more?

Which location was most popular?

How do the two airlines compare?