Comparing Flight Data and Social Media Sentiment

This intent of this report to compare metrics for domestic flights originating in the United States on February 24, 2015 with similar flights originating on November 23, 2016. The fight data was obtained from the United States Department of Transportation Website. Unfortunately, the November 2016 flight data will not be available until January 2017 so the assumption was made that flight data for the day before the Thanksgiving holiday in 2015 could be substituted for the 2016 flight data until the November 2016 data becomes available.

The flight data will be compared to social media activity on Twitter occurring on February 24, 2015 and November 23, 2016 so the correlation between the November 2015 flight data and the November 2016 social media data is for demonstration purposes only until the November 2016 flight data becomes available.

On December 9, 2013, American Airlines and US Airways merged to form American Airlines Group. Each Airline continues to operate independently through a planned integration period. A data set containing social media activity regarding domestic air travel in the United States on February 24,2015 was Obtained from Kaggle and is used to conduct a sentiment analysis of airline service during planned integration time period. The assumption was made that the combination of American Airlines and Us Airways data can be directly compared to the post merge data.

On October 17, 2015, the American and US Airways merger was finalized and the integration process was considered complete. This supports the assumption of being able to use the November 2015 flight data in place of the November 2016 flight data because American Airlines and US Airways were operating as one single airline under the American Airlines name.

To determine of the busiest airports in the United States in 2015, Wikipedia was sourced to scrape the List of the busiest airports in the United States. The Selector Gadget extension for Google Chrome was utilized to select the CSS data from the Wikipedia page.

To obtain latitude and longitude information, all airport codes and their location were downloaded from United States Department of Transportation

The most recent twitter data was obtained on November 24, 2016 and saved in .csv format to preserve the data. The code used to download the social media data from Twitter can be found in a separate report published to RPubs here: airline_tweets.html.

Note: If you have trouble with any of the links, try right clicking and selecting “open in a new window (or tab)”. You may also need to download the “airline_tweets.html” file and open it with a browser to view it properly.

Data Files

Note: The link to the data works fine in the knit .html version but does not open in RPubs.

To view the data in RPubs: right-click the link and select “open link in a new tab” or download the files individually and open them from your PC with the appropriate software.

Airline_Tweets_24FEB2015.csv

Airline_Tweets_24NOV2016.csv

Flights_NOV2015.csv

Flights_SEP2016.csv

airport_location.csv

Loading The Raw Data

#Load the Raw Flight Data from February 2015
RAW_FEB2015 <- read.csv("Flights_FEB2015.csv", header = TRUE, stringsAsFactors = FALSE)

#Load the Raw Flight Data from November 2015
RAW_NOV2015 <- read.csv("Flights_NOV2015.csv", header = TRUE, stringsAsFactors = FALSE)

#Load the raw data from United States Department of Transportation
location_df <- read_csv("airport_location.csv")

Preparing the Flights_FEB2015.csv Data

Delayed Departures by Airline

#Extract records and Add a counter field "numflight"
FLT_FEB2015 <- RAW_FEB2015 %>%
  select(FlightDate,Carrier,Origin,Dest,DepDel15) %>% 
  filter(FlightDate == "2015-02-24") 

#Add airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>% mutate(Airline = NA) %>%
  mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))

#Isolate airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>%
  filter(Airline != "NA")

#Isolate flights with departure delays 15 minutes or more
DEL_FEB2015 <- FLT_FEB2015 %>%
  select(Airline,Origin,DepDel15) %>%
  filter(DepDel15 > 0)

#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_FEB2015 <- DEL_FEB2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_FEB2015 %>% 
       select(Airline, `Delayed Departures` = DepDel15) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Airline Delays on February 24, 2015")

Airline Delays on February 24, 2015
Airline	Delayed Departures
American	275
Delta	1011
Southwest	573
United	251
USAir	266

#Plot the Combined USAir and American Delay Data
ggplot(
  (AGG_ARLN_FEB2015 %>% 
  select(Airline,DepDel15)),  
  aes(x=reorder(Airline,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airline") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Major Airlines with Delays on February 24, 2015")

Delayed Departures by Originating Airport on February 24, 2015

#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_FEB2015 <- DEL_FEB2015 %>%
  select(Origin,DepDel15) %>% 
  group_by(Origin) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#Plot the Top 10 Airports with Departure Delays
ggplot(
  (AGG_ORIG_FEB2015 %>% 
  select(Origin,DepDel15)) %>% top_n(10),  
  aes(x=reorder(Origin,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Originating Airport") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Top 10 Airports with Departure Delays on February 24, 2015")

Atlanta (ATL) is a major hub for Delta Airlines.

Preparing the Flights_NOV2015.csv Data

Delayed Departures by Airline

#Extract records and Add a counter field "numflight"
FLT_NOV2015 <- RAW_NOV2015 %>%
  select(FlightDate,Carrier,Origin,Dest,Cancelled,DepDel15) %>% 
  filter(FlightDate == "2015-11-25")

#Add airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>% mutate(Airline = NA) %>%
  mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))

#Isolate airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>%
  filter(Airline != "NA")

#Isolate flights with departure delays 15 minutes or more
DEL_NOV2015 <- FLT_NOV2015 %>%
  select(Airline,Origin,Dest,DepDel15) %>%
  filter(DepDel15 > 0)

#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_NOV2015 <- DEL_NOV2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_NOV2015 %>% 
       select(Airline, `Delayed Departures` = DepDel15) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Airline Delays on November 25, 2015")

Airline Delays on November 25, 2015
Airline	Delayed Departures
American	155
Delta	166
Southwest	312
United	141

#Plot the Combined USAir and American Delay Data
ggplot(
  (AGG_ARLN_NOV2015 %>% 
  select(Airline,DepDel15)),  
  aes(x=reorder(Airline,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airline") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Major Airlines with Delays on November 25, 2015")

Delayed Departures by Originating Airport on November 25, 2015

#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_NOV2015 <- DEL_NOV2015 %>%
  select(Origin,DepDel15) %>% 
  group_by(Origin) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#Plot the Top 10 Airports with Departure Delays
ggplot(
  (AGG_ORIG_NOV2015 %>% 
  select(Origin,DepDel15)) %>% top_n(10),  
  aes(x=reorder(Origin,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Originating Airport") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Top 10 Airports with Departure Delays on November 25, 2015")

Los Angles (LAX) is the second busiest airport in the United States.

Preparing a Combined Airline Aggregate Delay Table

#Combine USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_ARLN_FEB2015 %>% 
  mutate(Airline = ifelse(Airline == "US Airways","American",Airline))

#Aggregate the Combined USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_AA_FEB2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

AGG_ARLN_DEL <- inner_join(
  AGG_AA_FEB2015 %>% select(Airline,`Delays FEB 24, 2015`=DepDel15),
  AGG_ARLN_NOV2015 %>% select(Airline,`Delays NOV 24, 2015`=DepDel15),
  by="Airline") %>% 
  arrange(Airline)

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_DEL,
       justify=c('left','center','center'),
       caption="Airline Delays with American and USAir Merged")

Airline Delays with American and USAir Merged
Airline	Delays FEB 24, 2015	Delays NOV 24, 2015
American	275	155
Delta	1011	166
Southwest	573	312
United	251	141

#Plot the Combined USAir and American Delay Data
AGG_ARLN_DEL %>% 
  ggvis(~`Delays FEB 24, 2015`, ~`Delays NOV 24, 2015`) %>% 
  layer_points(fill = ~factor(Airline), size := 600, opacity := .8) %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015") %>% 
  add_axis("y", title = "Number of Delayed Departures Recorded on NOV 24, 2015")

American Airlines seems to have improved with the merger.

Preparing the Social Media Data

Loading The Twitter and Lexicon Data

#Load the raw twitter data downloaded from kaggle
T_RAW2015 <- read_csv("Airline_Tweets_24FEB2015.csv")

#Load the preserved twitter data from November 24, 2016
T_RAW2016 <- read_csv("Airline_Tweets_24NOV2016.csv")

#Create "bing" word dictionary for sentiment analysis
lex_bing <- sentiments %>% filter(lexicon == "bing")

#Create "finn" word dictionary for sentiment analysis
lex_finn <- sentiments %>% filter(lexicon == "AFINN")

Prepare the February 24, 2015 Twitter Data

#Create a list of words to drop
#This list targets "non-value-added" words
drop_words <- c("americanair",
                "american",
                "united",
                "usairways",
                "usair",
                "southwestair",
                "delta",
                "jetblue",
                "thanksgiving",
                "flight","fights",
                "airport","airports",
                "airline","airlines",
                "plane","planes",
                "fly","flying",
                "http","https",
                "lol","thx","hey",
                "hrs","air","fan",
                "skies",
                "wifi")

The “drop_words” list is subjective and can be tweaked as needed

#Select columns to match recently obtained data
TWT_2015 <- T_RAW2015 %>% select(airline,tweet_created,text,name) %>% 
  #grep only tweets that occurred on "2015-02-24"
  filter(grepl('2015-02-24', tweet_created)) %>% 
  filter(grepl('United|US Airways|Delta|Southwest|American', airline))

#Crate a regex pattern
reg <- "([^A-Za-z\\d']|'(?![A-Za-z\\d]))|[0-9]"

#Create a new data frame of key words
tweets_2015 <- TWT_2015 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  #Remove stop words and all two letter words
  filter(!word %in% stop_words$word,
         str_detect(word,"[a-z]{3,20}"))

#Remove unwanted words from "drop_words" vector  
tweets_2015 <- tweets_2015 %>% filter(!word %in% drop_words)

#Create a new aggregate data frame of key words
agg_words_2015 <- tweets_2015 %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n))

#Join extracted key words with sentiment dictionaries
words_2015_bing <- tweets_2015 %>% inner_join(lex_bing, by="word")
words_2015_finn <- tweets_2015 %>% inner_join(lex_finn, by="word")

Comparison Cloud for February 24, 2015 Twitter Sentiment

words_2015_bing %>% 
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "darkblue"),
                   max.words = 100)

The word “delayed” stands out and that is bad for sentiment

Bar Chart Comparison for February 24, 2015 Twitter Sentiment

words_2015_bing %>% 
  inner_join(lex_bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  filter(n > 10) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment") +
  coord_flip() +  theme_few() + 
  ggtitle("Frequently Used Words on February 24, 2015")

“Delayed” is clearly the dominant word tweeted

Prepare the November 24, 2016 Twitter Data

#Select columns to match recently obtained data
TWT_2016 <- T_RAW2016 %>% select(airline,tweet_created,text,name) %>% 
  #grep only tweets that occurred on "2016-02-24"
  filter(grepl('2016-11-24', tweet_created))

#Create a new data frame 
tweets_2016 <- TWT_2016 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  #Remove stop words and all two letter words
  filter(!word %in% stop_words$word,
         str_detect(word,"[a-z]{3,20}"))
  
#Remove unwanted words from "drop_words" vector  
tweets_2016 <- tweets_2016 %>% filter(!word %in% drop_words)

#Create a new aggregate data frame of key words
agg_words_2016 <- tweets_2016 %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n))

#Join extracted key words with sentiment dictionaries
words_2016_bing <- tweets_2016 %>% inner_join(lex_bing, by="word")
words_2016_finn <- tweets_2016 %>% inner_join(lex_finn, by="word")

Comparison Cloud for November 24, 2016 Twitter Sentiment

words_2016_bing %>% 
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "darkgreen"),
                   max.words = 100)

The word “happy” stands out and that is good for sentiment

Bar Chart Comparison for November 24, 2016 Twitter Sentiment

words_2016_bing %>% 
  inner_join(lex_bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  filter(n > 40) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment") +
  coord_flip() +  theme_few() + 
  ggtitle("Frequently Used Words on November 24, 2016")

Positive sentiment seems to greatly outweigh the negative sentiment

Overall Sentiment Comparison of Major Airlines on February 24, 2015

#Create an aggregate data frame of finn lexicon score
agg_air_2015 <- words_2015_finn %>%
  select(`Airline`=airline,score) %>% 
  group_by(Airline) %>% 
  summarize_each(funs(sum)) %>% 
  arrange(desc(score))

#View the finn lexicon score for each airline
pander(agg_air_2015 %>% 
       select(Airline, `Sentiment Score` = score) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Finn Lexicon Score on February 24, 2015")

Finn Lexicon Score on February 24, 2015
Airline	Sentiment Score
American	-200
Delta	-16
Southwest	-58
United	-50
US Airways	-88

Strong negative sentiment with a cumulative negative sum.

Major Airline Sentiment Score on February 24, 2015

ggplot(
  (agg_air_2015 %>%
  select(Airline,score)),  
  aes(x=reorder(Airline,score), y= score)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airline") +
  ylab("Finn Lexicon Sentiment Score") + coord_flip() +
  theme_few() + ggtitle("Major Airlines Sentiment Score on February 24, 2015")

Overall sentiment was negative and American Airlines received the most negative tweets

Clearly if we combined US Airways and American, they would dwarf the others

Overall Sentiment Comparison of Major Airlines on November 24, 2016

#Create an aggregate data frame of finn lexicon score
agg_air_2016 <- words_2016_finn %>%
  select(`Airline`=airline,score) %>% 
  group_by(Airline) %>% 
  summarize_each(funs(sum)) %>% 
  arrange(desc(score))

#View the finn lexicon score for each airline
pander(agg_air_2016 %>% 
       select(Airline, `Sentiment Score` = score) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Finn Lexicon Score on November 24, 2016")

Finn Lexicon Score on November 24, 2016
Airline	Sentiment Score
American	307
Delta	272
Southwest	967
United	64

Strong negative sentiment with aggregate scores all less than zero.

Major Airline Sentiment Score on November 24, 2016

ggplot(
  (agg_air_2016 %>%
  select(Airline,score)),  
  aes(x=reorder(Airline,score), y= score)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airline") +
  ylab("Finn Lexicon Sentiment Score") + coord_flip() +
  theme_few() + ggtitle("Major Airlines Sentiment Score on November 24, 2016")

It is interesting that the overall sentiment was positive.

Compare Positive vs. Negative Sentiment for major airlines for February 24, 2015

#Create data frame for delay percentages
p.airline_feb2015 <- FLT_FEB2015 %>% 
  select(Airline, DepDel15) %>% 
  filter(DepDel15 != "NA")

#Create aggregate data frame for delay percentages
p.airline_feb2015 <- p.airline_feb2015 %>% 
  select(Airline,DepDel15) %>% 
  mutate(count=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(Airline)

#View aggregate data before calculating percentage
pander(p.airline_feb2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
       justify=c('left','center','center'),
       caption="Values used to calculate delay percentages")

Values used to calculate delay percentages
Airline	Delays	Flights
American	275	1371
Delta	1011	2178
Southwest	573	3249
United	251	1331
USAir	266	1105

#Calculate the delay percentage
p.airline_feb2015 <- p.airline_feb2015 %>% 
  mutate(DepDel15 = round(100*(DepDel15/count)))

#Rename DepDel15 column
p.airline_feb2015 <- p.airline_feb2015 %>% 
  select(Airline,`Delay`=DepDel15)

#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2015 <- words_2015_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('positive',sentiment))

#Add a counter and summarize
agg_bing_pos_2015 <- agg_bing_pos_2015 %>% 
  select(Airline) %>%
  mutate(Positive=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Positive))

#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2015 <- words_2015_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('negative',sentiment))

#Add a counter and summarize
agg_bing_neg_2015 <- agg_bing_neg_2015 %>% 
  select(Airline) %>%
  mutate(Negative=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Negative))

#Join the bing lexicon tables
agg_bing_2015 <- inner_join(agg_bing_pos_2015,agg_bing_neg_2015,by="Airline") %>% 
  arrange(Airline)

#Calculated the Percentage of Positive vs. Negative words
agg_bing_2015 <- agg_bing_2015 %>% 
  select(Airline,Positive,Negative) %>% 
  mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>% 
  mutate(perNegative = round(100 *(Negative/(Positive + Negative))))

#Join the bing lexicon tables
agg_feb2015 <- inner_join(agg_bing_2015,p.airline_feb2015,by="Airline") %>% 
  arrange(desc(Airline))

agg_percent_feb2015 <- agg_feb2015 %>% 
  select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>% 
  arrange(desc(Airline))

#Melt the Comparison Variables  
agg_percent_feb2015.m <- melt(agg_percent_feb2015, id.vars='Airline')

ggplot(
  agg_percent_feb2015.m,
  aes(x=Airline, y=value, fill=variable)) +
  geom_bar(stat="identity", position=position_dodge()) +
  xlab("Airline") +
  ylab("Variable Percentage") + coord_flip() +
  theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for February 24, 2015")

Sentiment does not appear to be directly correlated to delays

The following table shows the numbers used to calculate the delay percentages.

#View the Data Scraped from Wikipedia
pander(agg_bing_2015 %>% select(Airline,Positive,Negative),
       justify=c('left','center','center'),
       caption="Distribution of bing sentiment for Febrary 24, 2015")

Distribution of bing sentiment for Febrary 24, 2015
Airline	Positive	Negative
American	183	338
Delta	28	41
Southwest	59	99
United	64	125
US Airways	50	116

Compare Positive vs. Negative Sentiment for major airlines for November 24, 2016

#Create data frame for delay percentages
p.airline_nov2015 <- FLT_NOV2015 %>% 
  select(Airline, DepDel15) %>% 
  filter(DepDel15 != "NA")

#Create aggregate data frame for delay percentages
p.airline_nov2015 <- p.airline_nov2015 %>% 
  select(Airline,DepDel15) %>% 
  mutate(count=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(Airline)

#View aggregate data before calculating percentage
pander(p.airline_nov2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
       justify=c('left','center','center'),
       caption="Values used to calculate delay percentages")

Values used to calculate delay percentages
Airline	Delays	Flights
American	155	2594
Delta	166	2592
Southwest	312	3799
United	141	1536

#Calculate the delay percentage
p.airline_nov2015 <- p.airline_nov2015 %>% 
  mutate(DepDel15 = round(100*(DepDel15/count)))

#Rename DepDel15 column
p.airline_nov2015 <- p.airline_nov2015 %>% 
  select(Airline,`Delay`=DepDel15)

#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2016 <- words_2016_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('positive',sentiment))

#Add a counter and summarize
agg_bing_pos_2016 <- agg_bing_pos_2016 %>% 
  select(Airline) %>%
  mutate(Positive=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Positive))

#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2016 <- words_2016_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('negative',sentiment))

#Add a counter and summarize
agg_bing_neg_2016 <- agg_bing_neg_2016 %>% 
  select(Airline) %>%
  mutate(Negative=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Negative))

#Join the bing lexicon tables
agg_bing_2016 <- inner_join(agg_bing_pos_2016,agg_bing_neg_2016,by="Airline") %>% 
  arrange(Airline)

#Calculated the Percentage of Positive vs. Negative words
agg_bing_2016 <- agg_bing_2016 %>% 
  select(Airline,Positive,Negative) %>% 
  mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>% 
  mutate(perNegative = round(100 *(Negative/(Positive + Negative))))

#Join the bing lexicon tables
agg_nov2016 <- inner_join(agg_bing_2016,p.airline_nov2015,by="Airline") %>% 
  arrange(desc(Airline))

agg_percent_nov2016 <- agg_nov2016 %>% 
  select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>% 
  arrange(desc(Airline))

#Melt the Comparison Variables  
agg_percent_nov2016.m <- melt(agg_percent_nov2016, id.vars='Airline')

ggplot(
  agg_percent_nov2016.m,
  aes(x=Airline, y=value, fill=variable)) +
  geom_bar(stat="identity", position=position_dodge()) +
  xlab("Airline") +
  ylab("Variable Percentage") + coord_flip() +
  theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for November 24, 2016")

This chart will be updated once the November 2016 data becomes available

It is difficult to determine if the overall positive sentiment is due to the lack of delayed departures or simply excitement for Thanksgiving holiday.

The perfect split between positive and negative for Delta is purely coincidental

The actual numbers were slightly different but once rounded as a percentage they become equal. The following table shows the numbers used to calculate the percentages.

#View the Data Scraped from Wikipedia
pander(agg_bing_2016 %>% select(Airline,Positive,Negative),
       justify=c('left','center','center'),
       caption="Distribution of bing sentiment for November 24, 2016")

Distribution of bing sentiment for November 24, 2016
Airline	Positive	Negative
American	436	465
Delta	447	449
Southwest	642	272
United	288	379

Scrape Top 30 Busiest Airports for Geocoding

#Read HTML Data
airports <- read_html("https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States")

rank <-  airports %>% html_nodes("td:nth-child(1)") %>% html_text()
name <-  airports %>% html_nodes("td:nth-child(2)") %>% html_text()
code <-  airports %>% html_nodes("td:nth-child(3)") %>% html_text()
city <- airports %>% html_nodes("td:nth-child(4)") %>% html_text()
state <- airports %>% html_nodes("td:nth-child(5)") %>% html_text()
boardings <- airports %>% html_nodes("td:nth-child(6)") %>% html_text()

After viewing the scraped data with list(), it became obvious the program selected a few extra lines of data.

A new data frame will be created using only the first 30 rows of each vector.

#Create new data frame from the first 30 rows in each vector
#rename column headings
hub_df <- data.frame(cbind(rank=rank[1:30],
                name=name[1:30],
                code=code[1:30],
                city=city[1:30],
                state=state[1:30],
                boardings=boardings[1:30]))

#Need to remove "/" characters for HTML rendering
#Wasn't having much luck so skipped city and name. 
#hub_df <- hub_df %>%
  #mutate(city = gsub("[/]", " ", city)) %>% 
  #mutate(city = gsub("[-]", " ", city)) %>% 
  #mutate(name = gsub("[/]", " ", name)) %>%
  #mutate(name = gsub("[-]", " ", name))

#View the Data Scraped from Wikipedia
pander(hub_df %>% select(`Rank`=rank,
                         `Airport`=code,
                         `Boardings`=boardings),
       justify=c('center','center','right'),
       caption="Top 30 Busiest Airport in the United States for 2015")

Top 30 Busiest Airport in the United States for 2015
Rank	Airport	Boardings
1	ATL	49,340,732
2	LAX	36,351,226
3	ORD	36,305,668
4	DFW	31,589,832
5	JFK	27,717,503
6	DEN	26,280,043
7	SFO	24,190,549
8	CLT	21,913,156
9	LAS	21,824,231
10	PHX	21,351,445
11	MIA	20,986,341
12	IAH	20,595,874
13	SEA	20,148,980
14	MCO	18,759,938
15	EWR	18,684,765
16	MSP	17,634,252
17	BOS	16,290,323
18	DTW	16,255,507
19	PHL	15,101,318
20	LGA	14,319,924
21	FLL	13,061,607
22	BWI	11,738,828
23	DCA	11,242,375
24	MDW	10,830,783
25	SLC	10,634,519
26	IAD	10,363,918
27	SAN	9,985,739
28	HNL	9,479,083
29	TPA	9,150,414
30	PDX	8,340,234

View the Twitter mentions of major airport on February 24, 2015

#Remove unwanted words from "drop_words" vector  
mention_2015 <- TWT_2015 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg)

#Convert the twitter words to all upper case
mention_2015$word <- toupper(mention_2015$word)
  
#Create an Aggregate Table for Mentions by airport code
agg_mention_2015 <- mention_2015 %>%
  select(word) %>%
  mutate(count=1) %>% 
  filter(word %in% hub_df$code) %>% 
  group_by(word) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(count))

#View All Major Airport with Twitter Mentions
pander(agg_mention_2015 %>% 
       select(`Airport`= word, `Twitter Mentions` = count) %>% 
       arrange(Airport),
       justify=c('left','center'),
       caption="Airports with Twitter Mentions on February 24, 2015")

Airports with Twitter Mentions on February 24, 2015
Airport	Twitter Mentions
ATL	6
BOS	2
BWI	3
CLT	7
DCA	11
DFW	37
EWR	4
FLL	7
IAD	5
IAH	4
JFK	7
LAS	3
LAX	15
LGA	2
MCO	3
MIA	5
MSP	1
ORD	10
PDX	3
PHL	9
PHX	3
SAN	4
SFO	3
TPA	2

#Plot the Top 10 Major Airport with Twitter Mentions
ggplot(
  (agg_mention_2015 %>% 
  select(word,count)) %>% top_n(10),  
  aes(x=reorder(word,count), y= count)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airport") +
  ylab("Number of Twitter Mentions") + coord_flip() +
  theme_few() + ggtitle("Top 10 Major Airports with Twitter Mentions on February 24, 2015")

View the Twitter mentions of major airport on November 24, 2016

#Remove unwanted words from "drop_words" vector  
mention_2016 <- TWT_2016 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg)

#Convert the twitter words to all upper case
mention_2016$word <- toupper(mention_2016$word)
  
#Create an Aggregate Table for Mentions by airport code
agg_mention_2016 <- mention_2016 %>%
  select(word) %>%
  mutate(count=1) %>% 
  filter(word %in% hub_df$code) %>% 
  group_by(word) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(count))

#View All Major Airport with Twitter Mentions
pander(agg_mention_2016 %>% 
       select(`Airport`= word, `Twitter Mentions` = count) %>% 
       arrange(Airport),
       justify=c('left','center'),
       caption="Airports with Twitter Mentions on November 24, 2016")

Airports with Twitter Mentions on November 24, 2016
Airport	Twitter Mentions
ATL	18
BOS	5
BWI	12
CLT	6
DCA	38
DEN	7
DFW	21
DTW	6
EWR	16
FLL	1
HNL	2
IAD	9
IAH	2
JFK	18
LAS	13
LAX	28
LGA	11
MCO	5
MDW	7
MIA	5
MSP	6
ORD	27
PDX	4
PHL	8
PHX	5
SAN	27
SEA	4
SFO	22
SLC	4
TPA	1

#Plot the Top 10 Major Airports with Twitter Mentions
ggplot(
  (agg_mention_2016 %>%
  select(word,count)) %>% top_n(10), 
  aes(x=reorder(word,count), y= count)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airport") +
  ylab("Number of Twitter Mentions") + coord_flip() +
  theme_few() + ggtitle("Top 10 Major Airport with Twitter Mentions on November 24, 2016")

Compare Social Media Airport Mentions with Origin Delays for February 24, 2015

#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_FEB2015 <- AGG_ORIG_FEB2015 %>%
  select("word"=Origin,DepDel15) %>%
  filter(word %in% hub_df$code) %>% 
  arrange(desc(DepDel15))

#Join Social Media Data with Flight Data
AGG_MAJOR_FEB2015 <- inner_join(AGG_MAJOR_FEB2015,agg_mention_2015,by="word")

#Rename the columns
AGG_MAJOR_FEB2015 <- AGG_MAJOR_FEB2015 %>% 
  select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)

#Plot the Combined USAir and American Delay Data
AGG_MAJOR_FEB2015 %>% 
  ggvis(~`Delays`, ~`Mentions`) %>% 
  layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>% 
  add_axis("y", title = "Number of Twitter Mentions Recorded on FEB 24, 2015") %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015")

No apparent correlation between delays and twitter mentions

It is difficult to discern the difference in colors but the outliers are ATL for delayed departures and DFW for Twitter mentions.

Compare Social Media Airport Mentions with Origin Delay Data for November 24, 2016

#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_NOV2016 <- AGG_ORIG_NOV2015 %>%
  select("word"=Origin,DepDel15) %>%
  filter(word %in% hub_df$code) %>% 
  arrange(desc(DepDel15))

#Join Social Media Data with Flight Data
AGG_MAJOR_NOV2016 <- inner_join(AGG_MAJOR_NOV2016,agg_mention_2016,by="word")

#Rename the columns
AGG_MAJOR_NOV2016 <- AGG_MAJOR_NOV2016 %>% 
  select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)

#Plot the Combined USAir and American Delay Data
AGG_MAJOR_NOV2016 %>% 
  ggvis(~`Delays`, ~`Mentions`) %>% 
  layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>% 
  add_axis("y", title = "Number of Twitter Mentions Recorded on NOV 24, 2016") %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on NOV 24, 2016")

This chart will be updated once the November 2016 data becomes available

Prepare Airport Location Data and Merge with Scraped Data

#Select the useful fields from the imported data
location_df <- location_df %>%
  select(code = AIRPORT,lat = LATITUDE,lon = LONGITUDE)

#Merge the imported data with the scraped data
geo_df <- merge(hub_df,  location_df, by = intersect("code", "code"))

#There were duplicate rows in airport_location.csv
#with different lat & lon values.
#Need to remove duplicate rows
geo_df <- geo_df[!duplicated(geo_df$code), ]

The data is now ready to be placed on a map

Location of Top 30 Busiest Airports in the USA for 2015

#Create popup meta data
hub_popup <- paste(
                    "<br><strong>Airport Name:</strong>",geo_df$name,
                    "<br><strong>Three Letter Code: </strong>",geo_df$code,
                    "<br><strong>City of Service: </strong>",geo_df$city,
                    "<br><strong>State of Service: </strong>",geo_df$state,
                    "<br><strong>Total Boardings in 2015: </strong>",geo_df$boardings,
                    "<br><strong>Rank by Total Boardings in 2015: </strong>",geo_df$rank,
                    "<br><strong>February 24, 2015 Departure Delays: </strong>",AGG_ORIG_FEB2015$DepDel15,
                    "<br><strong>February 24, 2015 Twitter Mentions: </strong>",agg_mention_2015$count,
                    "<br><strong>November 24, 2016 Departure Delays: </strong>",AGG_ORIG_NOV2015$DepDel15,
                    "<br><strong>November 24, 2016 Twitter Mentions: </strong>",agg_mention_2016$count
                   )

#Create a Map and Insert Markers from ptld_df
leaflet(data = geo_df) %>% addTiles() %>%
  setView(lng = -98.58, lat = 39.82, zoom = 4) %>%
  #Insert Third Party Tiles
  addProviderTiles("Esri.DeLorme") %>% 
  #Add Markers and Pop-up Data
  addMarkers(~lon, ~lat, popup = ~hub_popup)

Click on one of the markers to view airport data and twitter mentions

Domestic Air Travel in the United States

Carl Lolar

December 9, 2016

Airport	Twitter Mentions
ATL	6
BOS	2
BWI	3
CLT	7
DCA	11
DFW	37
EWR	4
FLL	7
IAD	5
IAH	4
JFK	7
LAS	3
LAX	15
LGA	2
MCO	3
MIA	5
MSP	1
ORD	10
PDX	3
PHL	9
PHX	3
SAN	4
SFO	3
TPA	2

Airport	Twitter Mentions
ATL	18
BOS	5
BWI	12
CLT	6
DCA	38
DEN	7
DFW	21
DTW	6
EWR	16
FLL	1
HNL	2
IAD	9
IAH	2
JFK	18
LAS	13
LAX	28
LGA	11
MCO	5
MDW	7
MIA	5
MSP	6
ORD	27
PDX	4
PHL	8
PHX	5
SAN	27
SEA	4
SFO	22
SLC	4
TPA	1

Airport	Twitter Mentions
ATL	6
BOS	2
BWI	3
CLT	7
DCA	11
DFW	37
EWR	4
FLL	7
IAD	5
IAH	4
JFK	7
LAS	3
LAX	15
LGA	2
MCO	3
MIA	5
MSP	1
ORD	10
PDX	3
PHL	9
PHX	3
SAN	4
SFO	3
TPA	2

Airport	Twitter Mentions
ATL	18
BOS	5
BWI	12
CLT	6
DCA	38
DEN	7
DFW	21
DTW	6
EWR	16
FLL	1
HNL	2
IAD	9
IAH	2
JFK	18
LAS	13
LAX	28
LGA	11
MCO	5
MDW	7
MIA	5
MSP	6
ORD	27
PDX	4
PHL	8
PHX	5
SAN	27
SEA	4
SFO	22
SLC	4
TPA	1

Domestic Air Travel in the United States

Carl Lolar

December 9, 2016

Comparing Flight Data and Social Media Sentiment

Data Files

Loading The Raw Data

Preparing the Flights_FEB2015.csv Data

Delayed Departures by Airline

Delayed Departures by Originating Airport on February 24, 2015

Preparing the Flights_NOV2015.csv Data

Delayed Departures by Airline

Delayed Departures by Originating Airport on November 25, 2015

Preparing a Combined Airline Aggregate Delay Table

Preparing the Social Media Data

Loading The Twitter and Lexicon Data

Prepare the February 24, 2015 Twitter Data

Comparison Cloud for February 24, 2015 Twitter Sentiment

Bar Chart Comparison for February 24, 2015 Twitter Sentiment

Prepare the November 24, 2016 Twitter Data

Comparison Cloud for November 24, 2016 Twitter Sentiment

Bar Chart Comparison for November 24, 2016 Twitter Sentiment

Overall Sentiment Comparison of Major Airlines on February 24, 2015

Major Airline Sentiment Score on February 24, 2015

Overall Sentiment Comparison of Major Airlines on November 24, 2016

Major Airline Sentiment Score on November 24, 2016

Compare Positive vs. Negative Sentiment for major airlines for February 24, 2015

Compare Positive vs. Negative Sentiment for major airlines for November 24, 2016

Scrape Top 30 Busiest Airports for Geocoding

View the Twitter mentions of major airport on February 24, 2015

View the Twitter mentions of major airport on November 24, 2016

Compare Social Media Airport Mentions with Origin Delays for February 24, 2015

Compare Social Media Airport Mentions with Origin Delay Data for November 24, 2016

Prepare Airport Location Data and Merge with Scraped Data

Location of Top 30 Busiest Airports in the USA for 2015

Note: Honolulu (HNL) is ranked in the top 30 busiest airports but does not show in the initial map.

Thank you for your time!

Airport	Twitter Mentions
ATL	6
BOS	2
BWI	3
CLT	7
DCA	11
DFW	37
EWR	4
FLL	7
IAD	5
IAH	4
JFK	7
LAS	3
LAX	15
LGA	2
MCO	3
MIA	5
MSP	1
ORD	10
PDX	3
PHL	9
PHX	3
SAN	4
SFO	3
TPA	2

Airport	Twitter Mentions
ATL	18
BOS	5
BWI	12
CLT	6
DCA	38
DEN	7
DFW	21
DTW	6
EWR	16
FLL	1
HNL	2
IAD	9
IAH	2
JFK	18
LAS	13
LAX	28
LGA	11
MCO	5
MDW	7
MIA	5
MSP	6
ORD	27
PDX	4
PHL	8
PHX	5
SAN	27
SEA	4
SFO	22
SLC	4
TPA	1