This intent of this report to compare metrics for domestic flights originating in the United States on February 24, 2015 with similar flights originating on November 23, 2016. The fight data was obtained from the United States Department of Transportation Website. Unfortunately, the November 2016 flight data will not be available until January 2017 so the assumption was made that flight data for the day before the Thanksgiving holiday in 2015 could be substituted for the 2016 flight data until the November 2016 data becomes available.
The flight data will be compared to social media activity on Twitter occurring on February 24, 2015 and November 23, 2016 so the correlation between the November 2015 flight data and the November 2016 social media data is for demonstration purposes only until the November 2016 flight data becomes available.
On December 9, 2013, American Airlines and US Airways merged to form American Airlines Group. Each Airline continues to operate independently through a planned integration period. A data set containing social media activity regarding domestic air travel in the United States on February 24,2015 was Obtained from Kaggle and is used to conduct a sentiment analysis of airline service during planned integration time period. The assumption was made that the combination of American Airlines and Us Airways data can be directly compared to the post merge data.
On October 17, 2015, the American and US Airways merger was finalized and the integration process was considered complete. This supports the assumption of being able to use the November 2015 flight data in place of the November 2016 flight data because American Airlines and US Airways were operating as one single airline under the American Airlines name.
To determine of the busiest airports in the United States in 2015, Wikipedia was sourced to scrape the List of the busiest airports in the United States. The Selector Gadget extension for Google Chrome was utilized to select the CSS data from the Wikipedia page.
To obtain latitude and longitude information, all airport codes and their location were downloaded from United States Department of Transportation
The most recent twitter data was obtained on November 24, 2016 and saved in .csv format to preserve the data. The code used to download the social media data from Twitter can be found in a separate report published to RPubs here: airline_tweets.html.
Note: If you have trouble with any of the links, try right clicking and selecting “open in a new window (or tab)”. You may also need to download the “airline_tweets.html” file and open it with a browser to view it properly.
Note: The link to the data works fine in the knit .html version but does not open in RPubs.
To view the data in RPubs: right-click the link and select “open link in a new tab” or download the files individually and open them from your PC with the appropriate software.
#Load the Raw Flight Data from February 2015
RAW_FEB2015 <- read.csv("Flights_FEB2015.csv", header = TRUE, stringsAsFactors = FALSE)
#Load the Raw Flight Data from November 2015
RAW_NOV2015 <- read.csv("Flights_NOV2015.csv", header = TRUE, stringsAsFactors = FALSE)
#Load the raw data from United States Department of Transportation
location_df <- read_csv("airport_location.csv")
#Extract records and Add a counter field "numflight"
FLT_FEB2015 <- RAW_FEB2015 %>%
select(FlightDate,Carrier,Origin,Dest,DepDel15) %>%
filter(FlightDate == "2015-02-24")
#Add airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>% mutate(Airline = NA) %>%
mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))
#Isolate airline names for carriers targeted for study
FLT_FEB2015 <- FLT_FEB2015 %>%
filter(Airline != "NA")
#Isolate flights with departure delays 15 minutes or more
DEL_FEB2015 <- FLT_FEB2015 %>%
select(Airline,Origin,DepDel15) %>%
filter(DepDel15 > 0)
#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_FEB2015 <- DEL_FEB2015 %>%
select(Airline,DepDel15) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(DepDel15))
#View the Combined USAir and American Delay Data
pander(AGG_ARLN_FEB2015 %>%
select(Airline, `Delayed Departures` = DepDel15) %>%
arrange(Airline),
justify=c('left','center'),
caption="Airline Delays on February 24, 2015")
| Airline | Delayed Departures |
|---|---|
| American | 275 |
| Delta | 1011 |
| Southwest | 573 |
| United | 251 |
| USAir | 266 |
#Plot the Combined USAir and American Delay Data
ggplot(
(AGG_ARLN_FEB2015 %>%
select(Airline,DepDel15)),
aes(x=reorder(Airline,DepDel15), y= DepDel15)) +
geom_bar(stat="identity", fill="darkorange3") +
xlab("Airline") +
ylab("Delayed Departures") + coord_flip() +
theme_few() + ggtitle("Major Airlines with Delays on February 24, 2015")
#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_FEB2015 <- DEL_FEB2015 %>%
select(Origin,DepDel15) %>%
group_by(Origin) %>%
summarise_each(funs(sum)) %>%
arrange(desc(DepDel15))
#Plot the Top 10 Airports with Departure Delays
ggplot(
(AGG_ORIG_FEB2015 %>%
select(Origin,DepDel15)) %>% top_n(10),
aes(x=reorder(Origin,DepDel15), y= DepDel15)) +
geom_bar(stat="identity", fill="darkorange3") +
xlab("Originating Airport") +
ylab("Delayed Departures") + coord_flip() +
theme_few() + ggtitle("Top 10 Airports with Departure Delays on February 24, 2015")
Atlanta (ATL) is a major hub for Delta Airlines.
#Extract records and Add a counter field "numflight"
FLT_NOV2015 <- RAW_NOV2015 %>%
select(FlightDate,Carrier,Origin,Dest,Cancelled,DepDel15) %>%
filter(FlightDate == "2015-11-25")
#Add airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>% mutate(Airline = NA) %>%
mutate(Airline = ifelse(Carrier == "AA", "American", Airline)) %>%
mutate(Airline = ifelse(Carrier == "DL", "Delta", Airline)) %>%
mutate(Airline = ifelse(Carrier == "UA", "United", Airline)) %>%
mutate(Airline = ifelse(Carrier == "US", "USAir", Airline)) %>%
mutate(Airline = ifelse(Carrier == "WN", "Southwest", Airline))
#Isolate airline names for carriers targeted for study
FLT_NOV2015 <- FLT_NOV2015 %>%
filter(Airline != "NA")
#Isolate flights with departure delays 15 minutes or more
DEL_NOV2015 <- FLT_NOV2015 %>%
select(Airline,Origin,Dest,DepDel15) %>%
filter(DepDel15 > 0)
#Create an Aggregate Table for Delayed Flights by Airline
AGG_ARLN_NOV2015 <- DEL_NOV2015 %>%
select(Airline,DepDel15) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(DepDel15))
#View the Combined USAir and American Delay Data
pander(AGG_ARLN_NOV2015 %>%
select(Airline, `Delayed Departures` = DepDel15) %>%
arrange(Airline),
justify=c('left','center'),
caption="Airline Delays on November 25, 2015")
| Airline | Delayed Departures |
|---|---|
| American | 155 |
| Delta | 166 |
| Southwest | 312 |
| United | 141 |
#Plot the Combined USAir and American Delay Data
ggplot(
(AGG_ARLN_NOV2015 %>%
select(Airline,DepDel15)),
aes(x=reorder(Airline,DepDel15), y= DepDel15)) +
geom_bar(stat="identity", fill="steelblue4") +
xlab("Airline") +
ylab("Delayed Departures") + coord_flip() +
theme_few() + ggtitle("Major Airlines with Delays on November 25, 2015")
#Create an Aggregate Table for Delayed Flights by Origin
#This table will be used later for popup data on map markers
AGG_ORIG_NOV2015 <- DEL_NOV2015 %>%
select(Origin,DepDel15) %>%
group_by(Origin) %>%
summarise_each(funs(sum)) %>%
arrange(desc(DepDel15))
#Plot the Top 10 Airports with Departure Delays
ggplot(
(AGG_ORIG_NOV2015 %>%
select(Origin,DepDel15)) %>% top_n(10),
aes(x=reorder(Origin,DepDel15), y= DepDel15)) +
geom_bar(stat="identity", fill="steelblue4") +
xlab("Originating Airport") +
ylab("Delayed Departures") + coord_flip() +
theme_few() + ggtitle("Top 10 Airports with Departure Delays on November 25, 2015")
Los Angles (LAX) is the second busiest airport in the United States.
#Combine USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_ARLN_FEB2015 %>%
mutate(Airline = ifelse(Airline == "US Airways","American",Airline))
#Aggregate the Combined USAir and American Delay Data
AGG_AA_FEB2015 <- AGG_AA_FEB2015 %>%
select(Airline,DepDel15) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(DepDel15))
AGG_ARLN_DEL <- inner_join(
AGG_AA_FEB2015 %>% select(Airline,`Delays FEB 24, 2015`=DepDel15),
AGG_ARLN_NOV2015 %>% select(Airline,`Delays NOV 24, 2015`=DepDel15),
by="Airline") %>%
arrange(Airline)
#View the Combined USAir and American Delay Data
pander(AGG_ARLN_DEL,
justify=c('left','center','center'),
caption="Airline Delays with American and USAir Merged")
| Airline | Delays FEB 24, 2015 | Delays NOV 24, 2015 |
|---|---|---|
| American | 275 | 155 |
| Delta | 1011 | 166 |
| Southwest | 573 | 312 |
| United | 251 | 141 |
#Plot the Combined USAir and American Delay Data
AGG_ARLN_DEL %>%
ggvis(~`Delays FEB 24, 2015`, ~`Delays NOV 24, 2015`) %>%
layer_points(fill = ~factor(Airline), size := 600, opacity := .8) %>%
add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015") %>%
add_axis("y", title = "Number of Delayed Departures Recorded on NOV 24, 2015")
American Airlines seems to have improved with the merger.
#Load the raw twitter data downloaded from kaggle
T_RAW2015 <- read_csv("Airline_Tweets_24FEB2015.csv")
#Load the preserved twitter data from November 24, 2016
T_RAW2016 <- read_csv("Airline_Tweets_24NOV2016.csv")
#Create "bing" word dictionary for sentiment analysis
lex_bing <- sentiments %>% filter(lexicon == "bing")
#Create "finn" word dictionary for sentiment analysis
lex_finn <- sentiments %>% filter(lexicon == "AFINN")
#Create a list of words to drop
#This list targets "non-value-added" words
drop_words <- c("americanair",
"american",
"united",
"usairways",
"usair",
"southwestair",
"delta",
"jetblue",
"thanksgiving",
"flight","fights",
"airport","airports",
"airline","airlines",
"plane","planes",
"fly","flying",
"http","https",
"lol","thx","hey",
"hrs","air","fan",
"skies",
"wifi")
The “drop_words” list is subjective and can be tweaked as needed
#Select columns to match recently obtained data
TWT_2015 <- T_RAW2015 %>% select(airline,tweet_created,text,name) %>%
#grep only tweets that occurred on "2015-02-24"
filter(grepl('2015-02-24', tweet_created)) %>%
filter(grepl('United|US Airways|Delta|Southwest|American', airline))
#Crate a regex pattern
reg <- "([^A-Za-z\\d']|'(?![A-Za-z\\d]))|[0-9]"
#Create a new data frame of key words
tweets_2015 <- TWT_2015 %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
#Remove stop words and all two letter words
filter(!word %in% stop_words$word,
str_detect(word,"[a-z]{3,20}"))
#Remove unwanted words from "drop_words" vector
tweets_2015 <- tweets_2015 %>% filter(!word %in% drop_words)
#Create a new aggregate data frame of key words
agg_words_2015 <- tweets_2015 %>%
group_by(word) %>%
summarize(n = n()) %>%
arrange(desc(n))
#Join extracted key words with sentiment dictionaries
words_2015_bing <- tweets_2015 %>% inner_join(lex_bing, by="word")
words_2015_finn <- tweets_2015 %>% inner_join(lex_finn, by="word")
words_2015_bing %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("darkred", "darkblue"),
max.words = 100)
The word “delayed” stands out and that is bad for sentiment
words_2015_bing %>%
inner_join(lex_bing) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
filter(n > 10) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment") +
coord_flip() + theme_few() +
ggtitle("Frequently Used Words on February 24, 2015")
“Delayed” is clearly the dominant word tweeted
#Select columns to match recently obtained data
TWT_2016 <- T_RAW2016 %>% select(airline,tweet_created,text,name) %>%
#grep only tweets that occurred on "2016-02-24"
filter(grepl('2016-11-24', tweet_created))
#Create a new data frame
tweets_2016 <- TWT_2016 %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
#Remove stop words and all two letter words
filter(!word %in% stop_words$word,
str_detect(word,"[a-z]{3,20}"))
#Remove unwanted words from "drop_words" vector
tweets_2016 <- tweets_2016 %>% filter(!word %in% drop_words)
#Create a new aggregate data frame of key words
agg_words_2016 <- tweets_2016 %>%
group_by(word) %>%
summarize(n = n()) %>%
arrange(desc(n))
#Join extracted key words with sentiment dictionaries
words_2016_bing <- tweets_2016 %>% inner_join(lex_bing, by="word")
words_2016_finn <- tweets_2016 %>% inner_join(lex_finn, by="word")
words_2016_bing %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("darkred", "darkgreen"),
max.words = 100)
The word “happy” stands out and that is good for sentiment
words_2016_bing %>%
inner_join(lex_bing) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
filter(n > 40) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment") +
coord_flip() + theme_few() +
ggtitle("Frequently Used Words on November 24, 2016")
Positive sentiment seems to greatly outweigh the negative sentiment
#Create an aggregate data frame of finn lexicon score
agg_air_2015 <- words_2015_finn %>%
select(`Airline`=airline,score) %>%
group_by(Airline) %>%
summarize_each(funs(sum)) %>%
arrange(desc(score))
#View the finn lexicon score for each airline
pander(agg_air_2015 %>%
select(Airline, `Sentiment Score` = score) %>%
arrange(Airline),
justify=c('left','center'),
caption="Finn Lexicon Score on February 24, 2015")
| Airline | Sentiment Score |
|---|---|
| American | -200 |
| Delta | -16 |
| Southwest | -58 |
| United | -50 |
| US Airways | -88 |
Strong negative sentiment with a cumulative negative sum.
ggplot(
(agg_air_2015 %>%
select(Airline,score)),
aes(x=reorder(Airline,score), y= score)) +
geom_bar(stat="identity", fill="darkorange3") +
xlab("Airline") +
ylab("Finn Lexicon Sentiment Score") + coord_flip() +
theme_few() + ggtitle("Major Airlines Sentiment Score on February 24, 2015")
Overall sentiment was negative and American Airlines received the most negative tweets
Clearly if we combined US Airways and American, they would dwarf the others
#Create an aggregate data frame of finn lexicon score
agg_air_2016 <- words_2016_finn %>%
select(`Airline`=airline,score) %>%
group_by(Airline) %>%
summarize_each(funs(sum)) %>%
arrange(desc(score))
#View the finn lexicon score for each airline
pander(agg_air_2016 %>%
select(Airline, `Sentiment Score` = score) %>%
arrange(Airline),
justify=c('left','center'),
caption="Finn Lexicon Score on November 24, 2016")
| Airline | Sentiment Score |
|---|---|
| American | 307 |
| Delta | 272 |
| Southwest | 967 |
| United | 64 |
Strong negative sentiment with aggregate scores all less than zero.
ggplot(
(agg_air_2016 %>%
select(Airline,score)),
aes(x=reorder(Airline,score), y= score)) +
geom_bar(stat="identity", fill="steelblue4") +
xlab("Airline") +
ylab("Finn Lexicon Sentiment Score") + coord_flip() +
theme_few() + ggtitle("Major Airlines Sentiment Score on November 24, 2016")
It is interesting that the overall sentiment was positive.
#Create data frame for delay percentages
p.airline_feb2015 <- FLT_FEB2015 %>%
select(Airline, DepDel15) %>%
filter(DepDel15 != "NA")
#Create aggregate data frame for delay percentages
p.airline_feb2015 <- p.airline_feb2015 %>%
select(Airline,DepDel15) %>%
mutate(count=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(Airline)
#View aggregate data before calculating percentage
pander(p.airline_feb2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
justify=c('left','center','center'),
caption="Values used to calculate delay percentages")
| Airline | Delays | Flights |
|---|---|---|
| American | 275 | 1371 |
| Delta | 1011 | 2178 |
| Southwest | 573 | 3249 |
| United | 251 | 1331 |
| USAir | 266 | 1105 |
#Calculate the delay percentage
p.airline_feb2015 <- p.airline_feb2015 %>%
mutate(DepDel15 = round(100*(DepDel15/count)))
#Rename DepDel15 column
p.airline_feb2015 <- p.airline_feb2015 %>%
select(Airline,`Delay`=DepDel15)
#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2015 <- words_2015_bing %>%
select("Airline" = airline,sentiment) %>%
filter(grepl('positive',sentiment))
#Add a counter and summarize
agg_bing_pos_2015 <- agg_bing_pos_2015 %>%
select(Airline) %>%
mutate(Positive=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(Positive))
#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2015 <- words_2015_bing %>%
select("Airline" = airline,sentiment) %>%
filter(grepl('negative',sentiment))
#Add a counter and summarize
agg_bing_neg_2015 <- agg_bing_neg_2015 %>%
select(Airline) %>%
mutate(Negative=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(Negative))
#Join the bing lexicon tables
agg_bing_2015 <- inner_join(agg_bing_pos_2015,agg_bing_neg_2015,by="Airline") %>%
arrange(Airline)
#Calculated the Percentage of Positive vs. Negative words
agg_bing_2015 <- agg_bing_2015 %>%
select(Airline,Positive,Negative) %>%
mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>%
mutate(perNegative = round(100 *(Negative/(Positive + Negative))))
#Join the bing lexicon tables
agg_feb2015 <- inner_join(agg_bing_2015,p.airline_feb2015,by="Airline") %>%
arrange(desc(Airline))
agg_percent_feb2015 <- agg_feb2015 %>%
select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>%
arrange(desc(Airline))
#Melt the Comparison Variables
agg_percent_feb2015.m <- melt(agg_percent_feb2015, id.vars='Airline')
ggplot(
agg_percent_feb2015.m,
aes(x=Airline, y=value, fill=variable)) +
geom_bar(stat="identity", position=position_dodge()) +
xlab("Airline") +
ylab("Variable Percentage") + coord_flip() +
theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for February 24, 2015")
Sentiment does not appear to be directly correlated to delays
The following table shows the numbers used to calculate the delay percentages.
#View the Data Scraped from Wikipedia
pander(agg_bing_2015 %>% select(Airline,Positive,Negative),
justify=c('left','center','center'),
caption="Distribution of bing sentiment for Febrary 24, 2015")
| Airline | Positive | Negative |
|---|---|---|
| American | 183 | 338 |
| Delta | 28 | 41 |
| Southwest | 59 | 99 |
| United | 64 | 125 |
| US Airways | 50 | 116 |
#Create data frame for delay percentages
p.airline_nov2015 <- FLT_NOV2015 %>%
select(Airline, DepDel15) %>%
filter(DepDel15 != "NA")
#Create aggregate data frame for delay percentages
p.airline_nov2015 <- p.airline_nov2015 %>%
select(Airline,DepDel15) %>%
mutate(count=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(Airline)
#View aggregate data before calculating percentage
pander(p.airline_nov2015 %>% select(Airline,`Delays`=DepDel15,`Flights`=count),
justify=c('left','center','center'),
caption="Values used to calculate delay percentages")
| Airline | Delays | Flights |
|---|---|---|
| American | 155 | 2594 |
| Delta | 166 | 2592 |
| Southwest | 312 | 3799 |
| United | 141 | 1536 |
#Calculate the delay percentage
p.airline_nov2015 <- p.airline_nov2015 %>%
mutate(DepDel15 = round(100*(DepDel15/count)))
#Rename DepDel15 column
p.airline_nov2015 <- p.airline_nov2015 %>%
select(Airline,`Delay`=DepDel15)
#Create an aggregate data frame of positive bing lexicon
agg_bing_pos_2016 <- words_2016_bing %>%
select("Airline" = airline,sentiment) %>%
filter(grepl('positive',sentiment))
#Add a counter and summarize
agg_bing_pos_2016 <- agg_bing_pos_2016 %>%
select(Airline) %>%
mutate(Positive=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(Positive))
#Create an aggregate data frame of negative bing lexicon
agg_bing_neg_2016 <- words_2016_bing %>%
select("Airline" = airline,sentiment) %>%
filter(grepl('negative',sentiment))
#Add a counter and summarize
agg_bing_neg_2016 <- agg_bing_neg_2016 %>%
select(Airline) %>%
mutate(Negative=1) %>%
group_by(Airline) %>%
summarise_each(funs(sum)) %>%
arrange(desc(Negative))
#Join the bing lexicon tables
agg_bing_2016 <- inner_join(agg_bing_pos_2016,agg_bing_neg_2016,by="Airline") %>%
arrange(Airline)
#Calculated the Percentage of Positive vs. Negative words
agg_bing_2016 <- agg_bing_2016 %>%
select(Airline,Positive,Negative) %>%
mutate(perPositve = round(100 *(Positive/(Positive + Negative)))) %>%
mutate(perNegative = round(100 *(Negative/(Positive + Negative))))
#Join the bing lexicon tables
agg_nov2016 <- inner_join(agg_bing_2016,p.airline_nov2015,by="Airline") %>%
arrange(desc(Airline))
agg_percent_nov2016 <- agg_nov2016 %>%
select(Airline,Delay,`Positive`=perPositve,`Negative`=perNegative) %>%
arrange(desc(Airline))
#Melt the Comparison Variables
agg_percent_nov2016.m <- melt(agg_percent_nov2016, id.vars='Airline')
ggplot(
agg_percent_nov2016.m,
aes(x=Airline, y=value, fill=variable)) +
geom_bar(stat="identity", position=position_dodge()) +
xlab("Airline") +
ylab("Variable Percentage") + coord_flip() +
theme_few() + ggtitle("Comparing Delay Percentage to Sentiment for November 24, 2016")
This chart will be updated once the November 2016 data becomes available
It is difficult to determine if the overall positive sentiment is due to the lack of delayed departures or simply excitement for Thanksgiving holiday.
The perfect split between positive and negative for Delta is purely coincidental
The actual numbers were slightly different but once rounded as a percentage they become equal. The following table shows the numbers used to calculate the percentages.
#View the Data Scraped from Wikipedia
pander(agg_bing_2016 %>% select(Airline,Positive,Negative),
justify=c('left','center','center'),
caption="Distribution of bing sentiment for November 24, 2016")
| Airline | Positive | Negative |
|---|---|---|
| American | 436 | 465 |
| Delta | 447 | 449 |
| Southwest | 642 | 272 |
| United | 288 | 379 |
#Read HTML Data
airports <- read_html("https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States")
rank <- airports %>% html_nodes("td:nth-child(1)") %>% html_text()
name <- airports %>% html_nodes("td:nth-child(2)") %>% html_text()
code <- airports %>% html_nodes("td:nth-child(3)") %>% html_text()
city <- airports %>% html_nodes("td:nth-child(4)") %>% html_text()
state <- airports %>% html_nodes("td:nth-child(5)") %>% html_text()
boardings <- airports %>% html_nodes("td:nth-child(6)") %>% html_text()
After viewing the scraped data with list(), it became obvious the program selected a few extra lines of data.
A new data frame will be created using only the first 30 rows of each vector.
#Create new data frame from the first 30 rows in each vector
#rename column headings
hub_df <- data.frame(cbind(rank=rank[1:30],
name=name[1:30],
code=code[1:30],
city=city[1:30],
state=state[1:30],
boardings=boardings[1:30]))
#Need to remove "/" characters for HTML rendering
#Wasn't having much luck so skipped city and name.
#hub_df <- hub_df %>%
#mutate(city = gsub("[/]", " ", city)) %>%
#mutate(city = gsub("[-]", " ", city)) %>%
#mutate(name = gsub("[/]", " ", name)) %>%
#mutate(name = gsub("[-]", " ", name))
#View the Data Scraped from Wikipedia
pander(hub_df %>% select(`Rank`=rank,
`Airport`=code,
`Boardings`=boardings),
justify=c('center','center','right'),
caption="Top 30 Busiest Airport in the United States for 2015")
| Rank | Airport | Boardings |
|---|---|---|
| 1 | ATL | 49,340,732 |
| 2 | LAX | 36,351,226 |
| 3 | ORD | 36,305,668 |
| 4 | DFW | 31,589,832 |
| 5 | JFK | 27,717,503 |
| 6 | DEN | 26,280,043 |
| 7 | SFO | 24,190,549 |
| 8 | CLT | 21,913,156 |
| 9 | LAS | 21,824,231 |
| 10 | PHX | 21,351,445 |
| 11 | MIA | 20,986,341 |
| 12 | IAH | 20,595,874 |
| 13 | SEA | 20,148,980 |
| 14 | MCO | 18,759,938 |
| 15 | EWR | 18,684,765 |
| 16 | MSP | 17,634,252 |
| 17 | BOS | 16,290,323 |
| 18 | DTW | 16,255,507 |
| 19 | PHL | 15,101,318 |
| 20 | LGA | 14,319,924 |
| 21 | FLL | 13,061,607 |
| 22 | BWI | 11,738,828 |
| 23 | DCA | 11,242,375 |
| 24 | MDW | 10,830,783 |
| 25 | SLC | 10,634,519 |
| 26 | IAD | 10,363,918 |
| 27 | SAN | 9,985,739 |
| 28 | HNL | 9,479,083 |
| 29 | TPA | 9,150,414 |
| 30 | PDX | 8,340,234 |
#Remove unwanted words from "drop_words" vector
mention_2015 <- TWT_2015 %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg)
#Convert the twitter words to all upper case
mention_2015$word <- toupper(mention_2015$word)
#Create an Aggregate Table for Mentions by airport code
agg_mention_2015 <- mention_2015 %>%
select(word) %>%
mutate(count=1) %>%
filter(word %in% hub_df$code) %>%
group_by(word) %>%
summarise_each(funs(sum)) %>%
arrange(desc(count))
#View All Major Airport with Twitter Mentions
pander(agg_mention_2015 %>%
select(`Airport`= word, `Twitter Mentions` = count) %>%
arrange(Airport),
justify=c('left','center'),
caption="Airports with Twitter Mentions on February 24, 2015")
| Airport | Twitter Mentions |
|---|---|
| ATL | 6 |
| BOS | 2 |
| BWI | 3 |
| CLT | 7 |
| DCA | 11 |
| DFW | 37 |
| EWR | 4 |
| FLL | 7 |
| IAD | 5 |
| IAH | 4 |
| JFK | 7 |
| LAS | 3 |
| LAX | 15 |
| LGA | 2 |
| MCO | 3 |
| MIA | 5 |
| MSP | 1 |
| ORD | 10 |
| PDX | 3 |
| PHL | 9 |
| PHX | 3 |
| SAN | 4 |
| SFO | 3 |
| TPA | 2 |
#Plot the Top 10 Major Airport with Twitter Mentions
ggplot(
(agg_mention_2015 %>%
select(word,count)) %>% top_n(10),
aes(x=reorder(word,count), y= count)) +
geom_bar(stat="identity", fill="darkorange3") +
xlab("Airport") +
ylab("Number of Twitter Mentions") + coord_flip() +
theme_few() + ggtitle("Top 10 Major Airports with Twitter Mentions on February 24, 2015")
#Remove unwanted words from "drop_words" vector
mention_2016 <- TWT_2016 %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg)
#Convert the twitter words to all upper case
mention_2016$word <- toupper(mention_2016$word)
#Create an Aggregate Table for Mentions by airport code
agg_mention_2016 <- mention_2016 %>%
select(word) %>%
mutate(count=1) %>%
filter(word %in% hub_df$code) %>%
group_by(word) %>%
summarise_each(funs(sum)) %>%
arrange(desc(count))
#View All Major Airport with Twitter Mentions
pander(agg_mention_2016 %>%
select(`Airport`= word, `Twitter Mentions` = count) %>%
arrange(Airport),
justify=c('left','center'),
caption="Airports with Twitter Mentions on November 24, 2016")
| Airport | Twitter Mentions |
|---|---|
| ATL | 18 |
| BOS | 5 |
| BWI | 12 |
| CLT | 6 |
| DCA | 38 |
| DEN | 7 |
| DFW | 21 |
| DTW | 6 |
| EWR | 16 |
| FLL | 1 |
| HNL | 2 |
| IAD | 9 |
| IAH | 2 |
| JFK | 18 |
| LAS | 13 |
| LAX | 28 |
| LGA | 11 |
| MCO | 5 |
| MDW | 7 |
| MIA | 5 |
| MSP | 6 |
| ORD | 27 |
| PDX | 4 |
| PHL | 8 |
| PHX | 5 |
| SAN | 27 |
| SEA | 4 |
| SFO | 22 |
| SLC | 4 |
| TPA | 1 |
#Plot the Top 10 Major Airports with Twitter Mentions
ggplot(
(agg_mention_2016 %>%
select(word,count)) %>% top_n(10),
aes(x=reorder(word,count), y= count)) +
geom_bar(stat="identity", fill="steelblue4") +
xlab("Airport") +
ylab("Number of Twitter Mentions") + coord_flip() +
theme_few() + ggtitle("Top 10 Major Airport with Twitter Mentions on November 24, 2016")
#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_FEB2015 <- AGG_ORIG_FEB2015 %>%
select("word"=Origin,DepDel15) %>%
filter(word %in% hub_df$code) %>%
arrange(desc(DepDel15))
#Join Social Media Data with Flight Data
AGG_MAJOR_FEB2015 <- inner_join(AGG_MAJOR_FEB2015,agg_mention_2015,by="word")
#Rename the columns
AGG_MAJOR_FEB2015 <- AGG_MAJOR_FEB2015 %>%
select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)
#Plot the Combined USAir and American Delay Data
AGG_MAJOR_FEB2015 %>%
ggvis(~`Delays`, ~`Mentions`) %>%
layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>%
add_axis("y", title = "Number of Twitter Mentions Recorded on FEB 24, 2015") %>%
add_axis("x", title = "Number of Delayed Departures Recorded on FEB 24, 2015")
No apparent correlation between delays and twitter mentions
It is difficult to discern the difference in colors but the outliers are ATL for delayed departures and DFW for Twitter mentions.
#Create an Aggregate Table for Mentions by airport code
AGG_MAJOR_NOV2016 <- AGG_ORIG_NOV2015 %>%
select("word"=Origin,DepDel15) %>%
filter(word %in% hub_df$code) %>%
arrange(desc(DepDel15))
#Join Social Media Data with Flight Data
AGG_MAJOR_NOV2016 <- inner_join(AGG_MAJOR_NOV2016,agg_mention_2016,by="word")
#Rename the columns
AGG_MAJOR_NOV2016 <- AGG_MAJOR_NOV2016 %>%
select(`Origin`=word,`Delays`=DepDel15,`Mentions`=count)
#Plot the Combined USAir and American Delay Data
AGG_MAJOR_NOV2016 %>%
ggvis(~`Delays`, ~`Mentions`) %>%
layer_points(fill = ~factor(Origin), size := 600, opacity := .8) %>%
add_axis("y", title = "Number of Twitter Mentions Recorded on NOV 24, 2016") %>%
add_axis("x", title = "Number of Delayed Departures Recorded on NOV 24, 2016")
This chart will be updated once the November 2016 data becomes available
#Select the useful fields from the imported data
location_df <- location_df %>%
select(code = AIRPORT,lat = LATITUDE,lon = LONGITUDE)
#Merge the imported data with the scraped data
geo_df <- merge(hub_df, location_df, by = intersect("code", "code"))
#There were duplicate rows in airport_location.csv
#with different lat & lon values.
#Need to remove duplicate rows
geo_df <- geo_df[!duplicated(geo_df$code), ]
The data is now ready to be placed on a map
#Create popup meta data
hub_popup <- paste(
"<br><strong>Airport Name:</strong>",geo_df$name,
"<br><strong>Three Letter Code: </strong>",geo_df$code,
"<br><strong>City of Service: </strong>",geo_df$city,
"<br><strong>State of Service: </strong>",geo_df$state,
"<br><strong>Total Boardings in 2015: </strong>",geo_df$boardings,
"<br><strong>Rank by Total Boardings in 2015: </strong>",geo_df$rank,
"<br><strong>February 24, 2015 Departure Delays: </strong>",AGG_ORIG_FEB2015$DepDel15,
"<br><strong>February 24, 2015 Twitter Mentions: </strong>",agg_mention_2015$count,
"<br><strong>November 24, 2016 Departure Delays: </strong>",AGG_ORIG_NOV2015$DepDel15,
"<br><strong>November 24, 2016 Twitter Mentions: </strong>",agg_mention_2016$count
)
#Create a Map and Insert Markers from ptld_df
leaflet(data = geo_df) %>% addTiles() %>%
setView(lng = -98.58, lat = 39.82, zoom = 4) %>%
#Insert Third Party Tiles
addProviderTiles("Esri.DeLorme") %>%
#Add Markers and Pop-up Data
addMarkers(~lon, ~lat, popup = ~hub_popup)
Click on one of the markers to view airport data and twitter mentions