Comparing Flight Data and Social Media Sentiment

 

This intent of this report to compare metrics for domestic flights originating in the United States on February 24, 2015 with similar flights originating on November 23, 2016. The fight data was obtained from the United States Department of Transportation Website. Unfortunately, the November 2016 flight data will not be available until January 2017 so the assumption was made that flight data for the day before the Thanksgiving holiday in 2015 could be substituted for the 2016 flight data until the November 2016 data becomes available.

The flight data will be compared to social media activity on Twitter occurring on February 24, 2015 and November 23, 2016 so the correlation between the November 2015 flight data and the November 2016 social media data is for demonstration purposes only until the November 2016 flight data becomes available.

On December 9, 2013, American Airlines and US Airways merged to form American Airlines Group. Each Airline continues to operate independently through a planned integration period. A data set containing social media activity regarding domestic air travel in the United States on February 24,2015 was Obtained from Kaggle and is used to conduct a sentiment analysis of airline service during planned integration time period. The assumption was made that the combination of American Airlines and Us Airways data can be directly compared to the post merge data.

On October 17, 2015, the American and US Airways merger was finalized and the integration process was considered complete. This supports the assumption of being able to use the November 2015 flight data in place of the November 2016 flight data because American Airlines and US Airways were operating as one single airline under the American Airlines name.

To determine of the busiest airports in the United States in 2015, Wikipedia was sourced to scrape the List of the busiest airports in the United States. The Selector Gadget extension for Google Chrome was utilized to select the CSS data from the Wikipedia page.

To obtain latitude and longitude information, all airport codes and their location were downloaded from United States Department of Transportation

The most recent twitter data was obtained on November 24, 2016 and saved in .csv format to preserve the data. The code used to download the social media data from Twitter can be found in a separate report published to RPubs here: airline_tweets.html.

Note: If you have trouble with any of the links, try right clicking and selecting “open in a new window (or tab)”. You may also need to download the “airline_tweets.html” file and open it with a browser to view it properly.

 


Data Files

Note: The link to the data works fine in the knit .html version but does not open in RPubs.

To view the data in RPubs: right-click the link and select “open link in a new tab” or download the files individually and open them from your PC with the appropriate software.

Airline_Tweets_24FEB2015.csv

Airline_Tweets_24NOV2016.csv

Flights_NOV2015.csv

Flights_SEP2016.csv

airport_location.csv

 


Loading The Raw Data

#Load the Raw Flight Data from February 2015
RAW_FEB2015 <- read.csv("Flights_FEB2015.csv", header = TRUE, stringsAsFactors = FALSE)

#Load the Raw Flight Data from November 2015
RAW_NOV2015 <- read.csv("Flights_NOV2015.csv", header = TRUE, stringsAsFactors = FALSE)

#Load the raw data from United States Department of Transportation
location_df <- read_csv("airport_location.csv")

 


Preparing the Flights_FEB2015.csv Data

Delayed Departures by Airline

#Extract records and Add a counter field "numflight"
FLT_FEB2015 <- RAW_FEB2015 %>%
  select(FlightDate,Carrier,Origin,Dest,DepDel15) %>% 
  filter(FlightDate == "2015-02-24") 

#Add airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>% mutate(Airline = NA) %>%
  mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))

#Isolate airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>%
  filter(Airline != "NA")

#Isolate flights with departure delays 15 minutes or more
DEL_FEB2015 <- FLT_FEB2015 %>%
  select(Airline,Origin,DepDel15) %>%
  filter(DepDel15 > 0)

#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_FEB2015 <- DEL_FEB2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_FEB2015 %>% 
       select(Airline, `Delayed Departures` = DepDel15) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Airline Delays on February 24, 2015")
Airline Delays on February 24, 2015
Airline Delayed Departures
American 275
Delta 1011
Southwest 573
United 251
USAir 266
#Plot the Combined USAir and American Delay Data
ggplot(
  (AGG_ARLN_FEB2015 %>% 
  select(Airline,DepDel15)),  
  aes(x=reorder(Airline,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airline") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Major Airlines with Delays on February 24, 2015")

 


Delayed Departures by Originating Airport on February 24, 2015

#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_FEB2015 <- DEL_FEB2015 %>%
  select(Origin,DepDel15) %>% 
  group_by(Origin) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#Plot the Top 10 Airports with Departure Delays
ggplot(
  (AGG_ORIG_FEB2015 %>% 
  select(Origin,DepDel15)) %>% top_n(10),  
  aes(x=reorder(Origin,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Originating Airport") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Top 10 Airports with Departure Delays on February 24, 2015")

Atlanta (ATL) is a major hub for Delta Airlines.

 


Preparing the Flights_NOV2015.csv Data

Delayed Departures by Airline

#Extract records and Add a counter field "numflight"
FLT_NOV2015 <- RAW_NOV2015 %>%
  select(FlightDate,Carrier,Origin,Dest,Cancelled,DepDel15) %>% 
  filter(FlightDate == "2015-11-25")

#Add airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>% mutate(Airline = NA) %>%
  mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
  mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))

#Isolate airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>%
  filter(Airline != "NA")

#Isolate flights with departure delays 15 minutes or more
DEL_NOV2015 <- FLT_NOV2015 %>%
  select(Airline,Origin,Dest,DepDel15) %>%
  filter(DepDel15 > 0)

#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_NOV2015 <- DEL_NOV2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_NOV2015 %>% 
       select(Airline, `Delayed Departures` = DepDel15) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Airline Delays on November 25, 2015")
Airline Delays on November 25, 2015
Airline Delayed Departures
American 155
Delta 166
Southwest 312
United 141
#Plot the Combined USAir and American Delay Data
ggplot(
  (AGG_ARLN_NOV2015 %>% 
  select(Airline,DepDel15)),  
  aes(x=reorder(Airline,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airline") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Major Airlines with Delays on November 25, 2015")

 


Delayed Departures by Originating Airport on November 25, 2015

#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_NOV2015 <- DEL_NOV2015 %>%
  select(Origin,DepDel15) %>% 
  group_by(Origin) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

#Plot the Top 10 Airports with Departure Delays
ggplot(
  (AGG_ORIG_NOV2015 %>% 
  select(Origin,DepDel15)) %>% top_n(10),  
  aes(x=reorder(Origin,DepDel15), y= DepDel15)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Originating Airport") +
  ylab("Delayed Departures") + coord_flip() +
  theme_few() + ggtitle("Top 10 Airports with Departure Delays on November 25, 2015")

Los Angles (LAX) is the second busiest airport in the United States.

 


Preparing a Combined Airline Aggregate Delay Table

#Combine USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_ARLN_FEB2015 %>% 
  mutate(Airline = ifelse(Airline == "US Airways","American",Airline))

#Aggregate the Combined USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_AA_FEB2015 %>%
  select(Airline,DepDel15) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(DepDel15))

AGG_ARLN_DEL <- inner_join(
  AGG_AA_FEB2015 %>% select(Airline,`Delays FEB 24, 2015`=DepDel15),
  AGG_ARLN_NOV2015 %>% select(Airline,`Delays NOV 24, 2015`=DepDel15),
  by="Airline") %>% 
  arrange(Airline)

#View the Combined USAir and American Delay Data
pander(AGG_ARLN_DEL,
       justify=c('left','center','center'),
       caption="Airline Delays with American and USAir Merged")
Airline Delays with American and USAir Merged
Airline Delays FEB 24, 2015 Delays NOV 24, 2015
American 275 155
Delta 1011 166
Southwest 573 312
United 251 141
#Plot the Combined USAir and American Delay Data
AGG_ARLN_DEL %>% 
  ggvis(~`Delays FEB 24, 2015`, ~`Delays NOV 24, 2015`) %>% 
  layer_points(fill = ~factor(Airline), size := 600, opacity := .8) %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015") %>% 
  add_axis("y", title = "Number of Delayed Departures Recorded on NOV 24, 2015")

American Airlines seems to have improved with the merger.

 


Preparing the Social Media Data

Loading The Twitter and Lexicon Data

#Load the raw twitter data downloaded from kaggle
T_RAW2015 <- read_csv("Airline_Tweets_24FEB2015.csv")

#Load the preserved twitter data from November 24, 2016
T_RAW2016 <- read_csv("Airline_Tweets_24NOV2016.csv")

#Create "bing" word dictionary for sentiment analysis
lex_bing <- sentiments %>% filter(lexicon == "bing")

#Create "finn" word dictionary for sentiment analysis
lex_finn <- sentiments %>% filter(lexicon == "AFINN")

 


Prepare the February 24, 2015 Twitter Data

#Create a list of words to drop
#This list targets "non-value-added" words
drop_words <- c("americanair",
                "american",
                "united",
                "usairways",
                "usair",
                "southwestair",
                "delta",
                "jetblue",
                "thanksgiving",
                "flight","fights",
                "airport","airports",
                "airline","airlines",
                "plane","planes",
                "fly","flying",
                "http","https",
                "lol","thx","hey",
                "hrs","air","fan",
                "skies",
                "wifi")

The “drop_words” list is subjective and can be tweaked as needed

 

#Select columns to match recently obtained data
TWT_2015 <- T_RAW2015 %>% select(airline,tweet_created,text,name) %>% 
  #grep only tweets that occurred on "2015-02-24"
  filter(grepl('2015-02-24', tweet_created)) %>% 
  filter(grepl('United|US Airways|Delta|Southwest|American', airline))

#Crate a regex pattern
reg <- "([^A-Za-z\\d']|'(?![A-Za-z\\d]))|[0-9]"

#Create a new data frame of key words
tweets_2015 <- TWT_2015 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  #Remove stop words and all two letter words
  filter(!word %in% stop_words$word,
         str_detect(word,"[a-z]{3,20}"))

#Remove unwanted words from "drop_words" vector  
tweets_2015 <- tweets_2015 %>% filter(!word %in% drop_words)

#Create a new aggregate data frame of key words
agg_words_2015 <- tweets_2015 %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n))

#Join extracted key words with sentiment dictionaries
words_2015_bing <- tweets_2015 %>% inner_join(lex_bing, by="word")
words_2015_finn <- tweets_2015 %>% inner_join(lex_finn, by="word")

 


Comparison Cloud for February 24, 2015 Twitter Sentiment

words_2015_bing %>% 
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "darkblue"),
                   max.words = 100)

 

The word “delayed” stands out and that is bad for sentiment

 


Bar Chart Comparison for February 24, 2015 Twitter Sentiment

words_2015_bing %>% 
  inner_join(lex_bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  filter(n > 10) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment") +
  coord_flip() +  theme_few() + 
  ggtitle("Frequently Used Words on February 24, 2015")

“Delayed” is clearly the dominant word tweeted


Prepare the November 24, 2016 Twitter Data

#Select columns to match recently obtained data
TWT_2016 <- T_RAW2016 %>% select(airline,tweet_created,text,name) %>% 
  #grep only tweets that occurred on "2016-02-24"
  filter(grepl('2016-11-24', tweet_created))

#Create a new data frame 
tweets_2016 <- TWT_2016 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  #Remove stop words and all two letter words
  filter(!word %in% stop_words$word,
         str_detect(word,"[a-z]{3,20}"))
  
#Remove unwanted words from "drop_words" vector  
tweets_2016 <- tweets_2016 %>% filter(!word %in% drop_words)

#Create a new aggregate data frame of key words
agg_words_2016 <- tweets_2016 %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n))

#Join extracted key words with sentiment dictionaries
words_2016_bing <- tweets_2016 %>% inner_join(lex_bing, by="word")
words_2016_finn <- tweets_2016 %>% inner_join(lex_finn, by="word")

 


Comparison Cloud for November 24, 2016 Twitter Sentiment

words_2016_bing %>% 
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("darkred", "darkgreen"),
                   max.words = 100)

The word “happy” stands out and that is good for sentiment

 


Bar Chart Comparison for November 24, 2016 Twitter Sentiment

words_2016_bing %>% 
  inner_join(lex_bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  filter(n > 40) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment") +
  coord_flip() +  theme_few() + 
  ggtitle("Frequently Used Words on November 24, 2016")

Positive sentiment seems to greatly outweigh the negative sentiment

 


Overall Sentiment Comparison of Major Airlines on February 24, 2015

#Create an aggregate data frame of finn lexicon score
agg_air_2015 <- words_2015_finn %>%
  select(`Airline`=airline,score) %>% 
  group_by(Airline) %>% 
  summarize_each(funs(sum)) %>% 
  arrange(desc(score))

#View the finn lexicon score for each airline
pander(agg_air_2015 %>% 
       select(Airline, `Sentiment Score` = score) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Finn Lexicon Score on February 24, 2015")
Finn Lexicon Score on February 24, 2015
Airline Sentiment Score
American -200
Delta -16
Southwest -58
United -50
US Airways -88

Strong negative sentiment with a cumulative negative sum.

 


Major Airline Sentiment Score on February 24, 2015

ggplot(
  (agg_air_2015 %>%
  select(Airline,score)),  
  aes(x=reorder(Airline,score), y= score)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airline") +
  ylab("Finn Lexicon Sentiment Score") + coord_flip() +
  theme_few() + ggtitle("Major Airlines Sentiment Score on February 24, 2015")

Overall sentiment was negative and American Airlines received the most negative tweets

Clearly if we combined US Airways and American, they would dwarf the others

 


Overall Sentiment Comparison of Major Airlines on November 24, 2016

#Create an aggregate data frame of finn lexicon score
agg_air_2016 <- words_2016_finn %>%
  select(`Airline`=airline,score) %>% 
  group_by(Airline) %>% 
  summarize_each(funs(sum)) %>% 
  arrange(desc(score))

#View the finn lexicon score for each airline
pander(agg_air_2016 %>% 
       select(Airline, `Sentiment Score` = score) %>% 
       arrange(Airline),
       justify=c('left','center'),
       caption="Finn Lexicon Score on November 24, 2016")
Finn Lexicon Score on November 24, 2016
Airline Sentiment Score
American 307
Delta 272
Southwest 967
United 64

Strong negative sentiment with aggregate scores all less than zero.

 


Major Airline Sentiment Score on November 24, 2016

ggplot(
  (agg_air_2016 %>%
  select(Airline,score)),  
  aes(x=reorder(Airline,score), y= score)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airline") +
  ylab("Finn Lexicon Sentiment Score") + coord_flip() +
  theme_few() + ggtitle("Major Airlines Sentiment Score on November 24, 2016")

It is interesting that the overall sentiment was positive.

 


Compare Positive vs. Negative Sentiment for major airlines for February 24, 2015

#Create data frame for delay percentages
p.airline_feb2015 <- FLT_FEB2015 %>% 
  select(Airline, DepDel15) %>% 
  filter(DepDel15 != "NA")

#Create aggregate data frame for delay percentages
p.airline_feb2015 <- p.airline_feb2015 %>% 
  select(Airline,DepDel15) %>% 
  mutate(count=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(Airline)

#View aggregate data before calculating percentage
pander(p.airline_feb2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
       justify=c('left','center','center'),
       caption="Values used to calculate delay percentages")
Values used to calculate delay percentages
Airline Delays Flights
American 275 1371
Delta 1011 2178
Southwest 573 3249
United 251 1331
USAir 266 1105
#Calculate the delay percentage
p.airline_feb2015 <- p.airline_feb2015 %>% 
  mutate(DepDel15 = round(100*(DepDel15/count)))

#Rename DepDel15 column
p.airline_feb2015 <- p.airline_feb2015 %>% 
  select(Airline,`Delay`=DepDel15)

#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2015 <- words_2015_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('positive',sentiment))

#Add a counter and summarize
agg_bing_pos_2015 <- agg_bing_pos_2015 %>% 
  select(Airline) %>%
  mutate(Positive=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Positive))

#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2015 <- words_2015_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('negative',sentiment))

#Add a counter and summarize
agg_bing_neg_2015 <- agg_bing_neg_2015 %>% 
  select(Airline) %>%
  mutate(Negative=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Negative))

#Join the bing lexicon tables
agg_bing_2015 <- inner_join(agg_bing_pos_2015,agg_bing_neg_2015,by="Airline") %>% 
  arrange(Airline)

#Calculated the Percentage of Positive vs. Negative words
agg_bing_2015 <- agg_bing_2015 %>% 
  select(Airline,Positive,Negative) %>% 
  mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>% 
  mutate(perNegative = round(100 *(Negative/(Positive + Negative))))

#Join the bing lexicon tables
agg_feb2015 <- inner_join(agg_bing_2015,p.airline_feb2015,by="Airline") %>% 
  arrange(desc(Airline))

agg_percent_feb2015 <- agg_feb2015 %>% 
  select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>% 
  arrange(desc(Airline))

#Melt the Comparison Variables  
agg_percent_feb2015.m <- melt(agg_percent_feb2015, id.vars='Airline')

ggplot(
  agg_percent_feb2015.m,
  aes(x=Airline, y=value, fill=variable)) +
  geom_bar(stat="identity", position=position_dodge()) +
  xlab("Airline") +
  ylab("Variable Percentage") + coord_flip() +
  theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for February 24, 2015")

Sentiment does not appear to be directly correlated to delays

The following table shows the numbers used to calculate the delay percentages.

#View the Data Scraped from Wikipedia
pander(agg_bing_2015 %>% select(Airline,Positive,Negative),
       justify=c('left','center','center'),
       caption="Distribution of bing sentiment for Febrary 24, 2015")
Distribution of bing sentiment for Febrary 24, 2015
Airline Positive Negative
American 183 338
Delta 28 41
Southwest 59 99
United 64 125
US Airways 50 116

 


Compare Positive vs. Negative Sentiment for major airlines for November 24, 2016

#Create data frame for delay percentages
p.airline_nov2015 <- FLT_NOV2015 %>% 
  select(Airline, DepDel15) %>% 
  filter(DepDel15 != "NA")

#Create aggregate data frame for delay percentages
p.airline_nov2015 <- p.airline_nov2015 %>% 
  select(Airline,DepDel15) %>% 
  mutate(count=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(Airline)

#View aggregate data before calculating percentage
pander(p.airline_nov2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
       justify=c('left','center','center'),
       caption="Values used to calculate delay percentages")
Values used to calculate delay percentages
Airline Delays Flights
American 155 2594
Delta 166 2592
Southwest 312 3799
United 141 1536
#Calculate the delay percentage
p.airline_nov2015 <- p.airline_nov2015 %>% 
  mutate(DepDel15 = round(100*(DepDel15/count)))

#Rename DepDel15 column
p.airline_nov2015 <- p.airline_nov2015 %>% 
  select(Airline,`Delay`=DepDel15)

#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2016 <- words_2016_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('positive',sentiment))

#Add a counter and summarize
agg_bing_pos_2016 <- agg_bing_pos_2016 %>% 
  select(Airline) %>%
  mutate(Positive=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Positive))

#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2016 <- words_2016_bing %>% 
  select("Airline" = airline,sentiment) %>% 
  filter(grepl('negative',sentiment))

#Add a counter and summarize
agg_bing_neg_2016 <- agg_bing_neg_2016 %>% 
  select(Airline) %>%
  mutate(Negative=1) %>% 
  group_by(Airline) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(Negative))

#Join the bing lexicon tables
agg_bing_2016 <- inner_join(agg_bing_pos_2016,agg_bing_neg_2016,by="Airline") %>% 
  arrange(Airline)

#Calculated the Percentage of Positive vs. Negative words
agg_bing_2016 <- agg_bing_2016 %>% 
  select(Airline,Positive,Negative) %>% 
  mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>% 
  mutate(perNegative = round(100 *(Negative/(Positive + Negative))))

#Join the bing lexicon tables
agg_nov2016 <- inner_join(agg_bing_2016,p.airline_nov2015,by="Airline") %>% 
  arrange(desc(Airline))

agg_percent_nov2016 <- agg_nov2016 %>% 
  select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>% 
  arrange(desc(Airline))

#Melt the Comparison Variables  
agg_percent_nov2016.m <- melt(agg_percent_nov2016, id.vars='Airline')

ggplot(
  agg_percent_nov2016.m,
  aes(x=Airline, y=value, fill=variable)) +
  geom_bar(stat="identity", position=position_dodge()) +
  xlab("Airline") +
  ylab("Variable Percentage") + coord_flip() +
  theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for November 24, 2016")

This chart will be updated once the November 2016 data becomes available

It is difficult to determine if the overall positive sentiment is due to the lack of delayed departures or simply excitement for Thanksgiving holiday.

The perfect split between positive and negative for Delta is purely coincidental

The actual numbers were slightly different but once rounded as a percentage they become equal. The following table shows the numbers used to calculate the percentages.

#View the Data Scraped from Wikipedia
pander(agg_bing_2016 %>% select(Airline,Positive,Negative),
       justify=c('left','center','center'),
       caption="Distribution of bing sentiment for November 24, 2016")
Distribution of bing sentiment for November 24, 2016
Airline Positive Negative
American 436 465
Delta 447 449
Southwest 642 272
United 288 379

 


Scrape Top 30 Busiest Airports for Geocoding

#Read HTML Data
airports <- read_html("https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States")

rank <-  airports %>% html_nodes("td:nth-child(1)") %>% html_text()
name <-  airports %>% html_nodes("td:nth-child(2)") %>% html_text()
code <-  airports %>% html_nodes("td:nth-child(3)") %>% html_text()
city <- airports %>% html_nodes("td:nth-child(4)") %>% html_text()
state <- airports %>% html_nodes("td:nth-child(5)") %>% html_text()
boardings <- airports %>% html_nodes("td:nth-child(6)") %>% html_text()

After viewing the scraped data with list(), it became obvious the program selected a few extra lines of data.

A new data frame will be created using only the first 30 rows of each vector.

#Create new data frame from the first 30 rows in each vector
#rename column headings
hub_df <- data.frame(cbind(rank=rank[1:30],
                name=name[1:30],
                code=code[1:30],
                city=city[1:30],
                state=state[1:30],
                boardings=boardings[1:30]))

#Need to remove "/" characters for HTML rendering
#Wasn't having much luck so skipped city and name. 
#hub_df <- hub_df %>%
  #mutate(city = gsub("[/]", " ", city)) %>% 
  #mutate(city = gsub("[-]", " ", city)) %>% 
  #mutate(name = gsub("[/]", " ", name)) %>%
  #mutate(name = gsub("[-]", " ", name))

#View the Data Scraped from Wikipedia
pander(hub_df %>% select(`Rank`=rank,
                         `Airport`=code,
                         `Boardings`=boardings),
       justify=c('center','center','right'),
       caption="Top 30 Busiest Airport in the United States for 2015")
Top 30 Busiest Airport in the United States for 2015
Rank Airport Boardings
1 ATL 49,340,732
2 LAX 36,351,226
3 ORD 36,305,668
4 DFW 31,589,832
5 JFK 27,717,503
6 DEN 26,280,043
7 SFO 24,190,549
8 CLT 21,913,156
9 LAS 21,824,231
10 PHX 21,351,445
11 MIA 20,986,341
12 IAH 20,595,874
13 SEA 20,148,980
14 MCO 18,759,938
15 EWR 18,684,765
16 MSP 17,634,252
17 BOS 16,290,323
18 DTW 16,255,507
19 PHL 15,101,318
20 LGA 14,319,924
21 FLL 13,061,607
22 BWI 11,738,828
23 DCA 11,242,375
24 MDW 10,830,783
25 SLC 10,634,519
26 IAD 10,363,918
27 SAN 9,985,739
28 HNL 9,479,083
29 TPA 9,150,414
30 PDX 8,340,234

 


View the Twitter mentions of major airport on February 24, 2015

#Remove unwanted words from "drop_words" vector  
mention_2015 <- TWT_2015 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg)

#Convert the twitter words to all upper case
mention_2015$word <- toupper(mention_2015$word)
  
#Create an Aggregate Table for Mentions by airport code
agg_mention_2015 <- mention_2015 %>%
  select(word) %>%
  mutate(count=1) %>% 
  filter(word %in% hub_df$code) %>% 
  group_by(word) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(count))

#View All Major Airport with Twitter Mentions
pander(agg_mention_2015 %>% 
       select(`Airport`= word, `Twitter Mentions` = count) %>% 
       arrange(Airport),
       justify=c('left','center'),
       caption="Airports with Twitter Mentions on February 24, 2015")
Airports with Twitter Mentions on February 24, 2015
Airport Twitter Mentions
ATL 6
BOS 2
BWI 3
CLT 7
DCA 11
DFW 37
EWR 4
FLL 7
IAD 5
IAH 4
JFK 7
LAS 3
LAX 15
LGA 2
MCO 3
MIA 5
MSP 1
ORD 10
PDX 3
PHL 9
PHX 3
SAN 4
SFO 3
TPA 2
#Plot the Top 10 Major Airport with Twitter Mentions
ggplot(
  (agg_mention_2015 %>% 
  select(word,count)) %>% top_n(10),  
  aes(x=reorder(word,count), y= count)) + 
  geom_bar(stat="identity", fill="darkorange3") + 
  xlab("Airport") +
  ylab("Number of Twitter Mentions") + coord_flip() +
  theme_few() + ggtitle("Top 10 Major Airports with Twitter Mentions on February 24, 2015")

 


View the Twitter mentions of major airport on November 24, 2016

#Remove unwanted words from "drop_words" vector  
mention_2016 <- TWT_2016 %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg)

#Convert the twitter words to all upper case
mention_2016$word <- toupper(mention_2016$word)
  
#Create an Aggregate Table for Mentions by airport code
agg_mention_2016 <- mention_2016 %>%
  select(word) %>%
  mutate(count=1) %>% 
  filter(word %in% hub_df$code) %>% 
  group_by(word) %>%
  summarise_each(funs(sum)) %>% 
  arrange(desc(count))

#View All Major Airport with Twitter Mentions
pander(agg_mention_2016 %>% 
       select(`Airport`= word, `Twitter Mentions` = count) %>% 
       arrange(Airport),
       justify=c('left','center'),
       caption="Airports with Twitter Mentions on November 24, 2016")
Airports with Twitter Mentions on November 24, 2016
Airport Twitter Mentions
ATL 18
BOS 5
BWI 12
CLT 6
DCA 38
DEN 7
DFW 21
DTW 6
EWR 16
FLL 1
HNL 2
IAD 9
IAH 2
JFK 18
LAS 13
LAX 28
LGA 11
MCO 5
MDW 7
MIA 5
MSP 6
ORD 27
PDX 4
PHL 8
PHX 5
SAN 27
SEA 4
SFO 22
SLC 4
TPA 1
#Plot the Top 10 Major Airports with Twitter Mentions
ggplot(
  (agg_mention_2016 %>%
  select(word,count)) %>% top_n(10), 
  aes(x=reorder(word,count), y= count)) + 
  geom_bar(stat="identity", fill="steelblue4") + 
  xlab("Airport") +
  ylab("Number of Twitter Mentions") + coord_flip() +
  theme_few() + ggtitle("Top 10 Major Airport with Twitter Mentions on November 24, 2016")

 


Compare Social Media Airport Mentions with Origin Delays for February 24, 2015

#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_FEB2015 <- AGG_ORIG_FEB2015 %>%
  select("word"=Origin,DepDel15) %>%
  filter(word %in% hub_df$code) %>% 
  arrange(desc(DepDel15))

#Join Social Media Data with Flight Data
AGG_MAJOR_FEB2015 <- inner_join(AGG_MAJOR_FEB2015,agg_mention_2015,by="word")

#Rename the columns
AGG_MAJOR_FEB2015 <- AGG_MAJOR_FEB2015 %>% 
  select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)

#Plot the Combined USAir and American Delay Data
AGG_MAJOR_FEB2015 %>% 
  ggvis(~`Delays`, ~`Mentions`) %>% 
  layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>% 
  add_axis("y", title = "Number of Twitter Mentions Recorded on FEB 24, 2015") %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015")

No apparent correlation between delays and twitter mentions

It is difficult to discern the difference in colors but the outliers are ATL for delayed departures and DFW for Twitter mentions.

 


Compare Social Media Airport Mentions with Origin Delay Data for November 24, 2016

#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_NOV2016 <- AGG_ORIG_NOV2015 %>%
  select("word"=Origin,DepDel15) %>%
  filter(word %in% hub_df$code) %>% 
  arrange(desc(DepDel15))

#Join Social Media Data with Flight Data
AGG_MAJOR_NOV2016 <- inner_join(AGG_MAJOR_NOV2016,agg_mention_2016,by="word")

#Rename the columns
AGG_MAJOR_NOV2016 <- AGG_MAJOR_NOV2016 %>% 
  select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)

#Plot the Combined USAir and American Delay Data
AGG_MAJOR_NOV2016 %>% 
  ggvis(~`Delays`, ~`Mentions`) %>% 
  layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>% 
  add_axis("y", title = "Number of Twitter Mentions Recorded on NOV 24, 2016") %>% 
  add_axis("x", title = "Number of Delayed Departures Recorded on NOV 24, 2016")

This chart will be updated once the November 2016 data becomes available

 


Prepare Airport Location Data and Merge with Scraped Data

#Select the useful fields from the imported data
location_df <- location_df %>%
  select(code = AIRPORT,lat = LATITUDE,lon = LONGITUDE)

#Merge the imported data with the scraped data
geo_df <- merge(hub_df,  location_df, by = intersect("code", "code"))

#There were duplicate rows in airport_location.csv
#with different lat & lon values.
#Need to remove duplicate rows
geo_df <- geo_df[!duplicated(geo_df$code), ]

The data is now ready to be placed on a map


Location of Top 30 Busiest Airports in the USA for 2015

#Create popup meta data
hub_popup <- paste(
                    "<br><strong>Airport Name:</strong>",geo_df$name,
                    "<br><strong>Three Letter Code: </strong>",geo_df$code,
                    "<br><strong>City of Service: </strong>",geo_df$city,
                    "<br><strong>State of Service: </strong>",geo_df$state,
                    "<br><strong>Total Boardings in 2015: </strong>",geo_df$boardings,
                    "<br><strong>Rank by Total Boardings in 2015: </strong>",geo_df$rank,
                    "<br><strong>February 24, 2015 Departure Delays: </strong>",AGG_ORIG_FEB2015$DepDel15,
                    "<br><strong>February 24, 2015 Twitter Mentions: </strong>",agg_mention_2015$count,
                    "<br><strong>November 24, 2016 Departure Delays: </strong>",AGG_ORIG_NOV2015$DepDel15,
                    "<br><strong>November 24, 2016 Twitter Mentions: </strong>",agg_mention_2016$count
                   )

#Create a Map and Insert Markers from ptld_df
leaflet(data = geo_df) %>% addTiles() %>%
  setView(lng = -98.58, lat = 39.82, zoom = 4) %>%
  #Insert Third Party Tiles
  addProviderTiles("Esri.DeLorme") %>% 
  #Add Markers and Pop-up Data
  addMarkers(~lon, ~lat, popup = ~hub_popup)

Click on one of the markers to view airport data and twitter mentions

Note: Honolulu (HNL) is ranked in the top 30 busiest airports but does not show in the initial map.

 

Thank you for your time!