Searching for an interesting dataset to do analysis on in Kaggle, I encountered a fascinating dataset containing tweets about different airlines. As transportation is an integral part of any trip, I found it useful and exciting to get some insights into the sentiment of Twitter users about popular airlines. Conducting text mining on these tweets could hopefully lead to a fair comparison between these airlines which would help customers to select an airline with more positive feedback.
The Guiding Questions for this analysis are as follows:
Which airline does have the highest satisfaction among its customers?
What are the most commonly used words among Twitter users about airlines?
How is the result of each lexicon different from the others for a similar dataset?
We add all packages that would be needed for this project.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(wordcloud2)
First, the dataset is imported:
tweets <- read_csv("data/tweets.csv")
## Rows: 14640 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): airline_sentiment, negativereason, airline, airline_sentiment_gold...
## dbl (4): tweet_id, airline_sentiment_confidence, negativereason_confidence,...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tweets)
## # A tibble: 6 x 15
## tweet_id airline_sentiment airline_sentiment~ negativereason negativereason_c~
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 5.70e17 neutral 1 <NA> NA
## 2 5.70e17 positive 0.349 <NA> 0
## 3 5.70e17 neutral 0.684 <NA> NA
## 4 5.70e17 negative 1 Bad Flight 0.703
## 5 5.70e17 negative 1 Can't Tell 1
## 6 5.70e17 negative 1 Can't Tell 0.684
## # ... with 10 more variables: airline <chr>, airline_sentiment_gold <chr>,
## # name <chr>, negativereason_gold <chr>, retweet_count <dbl>, text <chr>,
## # tweet_coord <chr>, tweet_created <chr>, tweet_location <chr>,
## # user_timezone <chr>
Since there are some features in our dataset that we do not need, we pick the desired ones:
selected_tweets <- select(tweets, airline, text)
selected_tweets
## # A tibble: 14,640 x 2
## airline text
## <chr> <chr>
## 1 Virgin America "@VirginAmerica What @dhepburn said."
## 2 Virgin America "@VirginAmerica plus you've added commercials to the experien~
## 3 Virgin America "@VirginAmerica I didn't today... Must mean I need to take an~
## 4 Virgin America "@VirginAmerica it's really aggressive to blast obnoxious \"e~
## 5 Virgin America "@VirginAmerica and it's a really big bad thing about it"
## 6 Virgin America "@VirginAmerica seriously would pay $30 a flight for seats th~
## 7 Virgin America "@VirginAmerica yes, nearly every time I fly VX this “ear wor~
## 8 Virgin America "@VirginAmerica Really missed a prime opportunity for Men Wit~
## 9 Virgin America "@virginamerica Well, I didn't…but NOW I DO! :-D"
## 10 Virgin America "@VirginAmerica it was amazing, and arrived an hour early. Yo~
## # ... with 14,630 more rows
In the next step, we tokenize and remove stop words from our data:
tokens_tweets <-
selected_tweets %>%
unnest_tokens(output = word,
input = text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tokens_tweets
## # A tibble: 256,747 x 2
## airline word
## <chr> <chr>
## 1 Virgin America @virginamerica
## 2 Virgin America what
## 3 Virgin America @dhepburn
## 4 Virgin America said
## 5 Virgin America @virginamerica
## 6 Virgin America plus
## 7 Virgin America youve
## 8 Virgin America added
## 9 Virgin America commercials
## 10 Virgin America to
## # ... with 256,737 more rows
tidy_tweets <-
tokens_tweets %>%
anti_join(stop_words, by = "word")
tidy_tweets
## # A tibble: 120,547 x 2
## airline word
## <chr> <chr>
## 1 Virgin America @virginamerica
## 2 Virgin America @dhepburn
## 3 Virgin America @virginamerica
## 4 Virgin America youve
## 5 Virgin America added
## 6 Virgin America commercials
## 7 Virgin America experience
## 8 Virgin America tacky
## 9 Virgin America @virginamerica
## 10 Virgin America didnt
## # ... with 120,537 more rows
One of the guiding questions at the beginning of this project is about the most repeated words in tweets. To find out the answer, the following code is run:
count_tweets <- count(tidy_tweets, word, sort = T)
count_tweets
## # A tibble: 17,145 x 2
## word n
## <chr> <int>
## 1 flight 3865
## 2 @united 3827
## 3 @usairways 2972
## 4 @americanair 2913
## 5 @southwestair 2426
## 6 @jetblue 2092
## 7 cancelled 1048
## 8 service 951
## 9 time 769
## 10 im 763
## # ... with 17,135 more rows
Taking a look at the result, we would find out that some meaningless words and tags are among top words. So we remove them:
tidy_tweets <-
tokens_tweets %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "@united" & !word == "@usairways"
& !word == "@virginamerica" & !word == "@southwestair"
& !word == "@jetblue"& !word == "@americanair" & !word == "im" & !word == "amp" & word >= 50)
count_tweets <- count(tidy_tweets, word, sort = T)
Now, it is time to obtain sentiments using four lexicons: Bing, AFINN, Loughran, and NRC.
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
loughran <- get_sentiments("loughran")
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 11,874 x 3
## airline word value
## <chr> <chr> <dbl>
## 1 Virgin America aggressive -2
## 2 Virgin America obnoxious -3
## 3 Virgin America bad -3
## 4 Virgin America pay -1
## 5 Virgin America bad -3
## 6 Virgin America missed -2
## 7 Virgin America opportunity 2
## 8 Virgin America amazing 4
## 9 Virgin America suicide -2
## 10 Virgin America death -2
## # ... with 11,864 more rows
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 11,199 x 3
## airline word sentiment
## <chr> <chr> <chr>
## 1 Virgin America tacky negative
## 2 Virgin America aggressive negative
## 3 Virgin America obnoxious negative
## 4 Virgin America bad negative
## 5 Virgin America bad negative
## 6 Virgin America missed negative
## 7 Virgin America parody negative
## 8 Virgin America amazing positive
## 9 Virgin America suicide negative
## 10 Virgin America leading positive
## # ... with 11,189 more rows
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")
sentiment_nrc
## # A tibble: 42,906 x 3
## airline word sentiment
## <chr> <chr> <chr>
## 1 Virgin America trip surprise
## 2 Virgin America aggressive anger
## 3 Virgin America aggressive fear
## 4 Virgin America aggressive negative
## 5 Virgin America blast anger
## 6 Virgin America blast fear
## 7 Virgin America blast negative
## 8 Virgin America blast surprise
## 9 Virgin America obnoxious anger
## 10 Virgin America obnoxious disgust
## # ... with 42,896 more rows
sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")
sentiment_loughran
## # A tibble: 7,533 x 3
## airline word sentiment
## <chr> <chr> <chr>
## 1 Virgin America recourse litigious
## 2 Virgin America bad negative
## 3 Virgin America bad negative
## 4 Virgin America missed negative
## 5 Virgin America opportunity positive
## 6 Virgin America leading positive
## 7 Virgin America excited positive
## 8 Virgin America innovation positive
## 9 Virgin America miss negative
## 10 Virgin America worry negative
## # ... with 7,523 more rows
To get a general idea about the output of each lexicon on our data, we write the following code and see how many tokens have positive and negative sense:
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_bing
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 7364
## 2 positive 3835
Now, we can group them based on each airline. We would use mutate to add a column containing subtraction of positive sentiment from negative, which makes a comparison between airlines simpler.
summary_bing <- sentiment_bing %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 6 x 5
## # Groups: airline [6]
## lexicon airline negative positive sentiment
## <chr> <chr> <int> <int> <int>
## 1 bing American 1476 659 -817
## 2 bing Delta 790 603 -187
## 3 bing Southwest 886 737 -149
## 4 bing United 2273 1006 -1267
## 5 bing US Airways 1800 649 -1151
## 6 bing Virgin America 139 181 42
We do the same for AFINN:
summary_afinn <- sentiment_afinn %>%
group_by(airline) %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 6 x 3
## lexicon airline sentiment
## <chr> <chr> <dbl>
## 1 AFINN American -1290
## 2 AFINN Delta 180
## 3 AFINN Southwest 0
## 4 AFINN United -1588
## 5 AFINN US Airways -1865
## 6 AFINN Virgin America 176
And for NRC:
summary_nrc <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc") %>%
spread(sentiment, n) %>%
mutate(sentiment = positive/negative)
summary_nrc
## # A tibble: 6 x 5
## # Groups: airline [6]
## airline method negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 American nrc 1343 1463 1.09
## 2 Delta nrc 823 1143 1.39
## 3 Southwest nrc 869 1379 1.59
## 4 United nrc 2125 2448 1.15
## 5 US Airways nrc 1641 1606 0.979
## 6 Virgin America nrc 166 288 1.73
Also for Loughran:
summary_loughran <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran") %>%
spread(sentiment, n) %>%
mutate(sentiment = positive/negative) %>%
relocate(method)
summary_loughran
## # A tibble: 6 x 5
## # Groups: airline [6]
## method airline negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 loughran American 1293 133 0.103
## 2 loughran Delta 653 131 0.201
## 3 loughran Southwest 831 186 0.224
## 4 loughran United 1830 260 0.142
## 5 loughran US Airways 1404 155 0.110
## 6 loughran Virgin America 119 41 0.345
I found word2cloud the best tool to show the highly-used words in our data.
top_tokens<- tidy_tweets %>%
count(word, sort = TRUE)%>%
top_n(100)
## Selecting by n
wordcloud2(top_tokens, size = 1, shape = 'star')
Also the number of tweets about each airline is demonstrated in the following bar chart.
ggplot(selected_tweets, aes(x = airline, fill = airline)) +
geom_bar(width = .8, show.legend = FALSE) +
xlab(label = "Airline") +
ylab(label = "Number of Tweets")
Based on the above bar chart, the highest proportion of tweets belongs to “United” and the lowest belongs to “Virgin America”.
Similar to Unit 2 Walkthrough, I think it would be helpful to measure an overall sentiment score for each tweet instead of each word.
polish_text <- tweets %>%
select(tweet_id, airline, text)
sentiment_afinn <- polish_text %>%
unnest_tokens(output = word,
input = text,
token = "tweets") %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "@united" & !word == "@usairways"
& !word == "@virginamerica" & !word == "@southwestair"
& !word == "@jetblue"& !word == "@americanair" & !word == "im" & !word == "amp" & word > 50) %>%
inner_join(afinn, by = "word")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
sentiment_afinn
## # A tibble: 11,874 x 4
## tweet_id airline word value
## <dbl> <chr> <chr> <dbl>
## 1 5.70e17 Virgin America aggressive -2
## 2 5.70e17 Virgin America obnoxious -3
## 3 5.70e17 Virgin America bad -3
## 4 5.70e17 Virgin America pay -1
## 5 5.70e17 Virgin America bad -3
## 6 5.70e17 Virgin America missed -2
## 7 5.70e17 Virgin America opportunity 2
## 8 5.70e17 Virgin America amazing 4
## 9 5.70e17 Virgin America suicide -2
## 10 5.70e17 Virgin America death -2
## # ... with 11,864 more rows
Now, we can find out the score for each tweet. In this step, scores equal to zero would be omitted and the rest would be categorized into positive and negative based on their value.
afinn_score <- sentiment_afinn %>%
group_by(airline, tweet_id) %>%
summarise(value = sum(value))
## `summarise()` has grouped output by 'airline'. You can override using the `.groups` argument.
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 7,573 x 4
## # Groups: airline [6]
## airline tweet_id value sentiment
## <chr> <dbl> <dbl> <chr>
## 1 American 5.69e17 1 positive
## 2 American 5.70e17 2 positive
## 3 American 5.70e17 -6 negative
## 4 American 5.70e17 -1 negative
## 5 American 5.70e17 -1 negative
## 6 American 5.70e17 -1 negative
## 7 American 5.70e17 -1 negative
## 8 American 5.70e17 4 positive
## 9 American 5.70e17 3 positive
## 10 American 5.70e17 -2 negative
## # ... with 7,563 more rows
This would be easier to interpret if we convert them into the ratio of negative tweets to positive:
afinn_ratio <- afinn_sentiment %>%
group_by(airline) %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 6 x 4
## # Groups: airline [6]
## airline negative positive ratio
## <chr> <int> <int> <dbl>
## 1 American 969 456 2.12
## 2 Delta 531 507 1.05
## 3 Southwest 664 537 1.24
## 4 United 1341 771 1.74
## 5 US Airways 1131 437 2.59
## 6 Virgin America 96 133 0.722
The circle graphs of airlines with the highest and lowest ratio would also illustrate the significance of the ratio.
afinn_counts_United <- afinn_sentiment %>%
group_by(airline) %>%
count(sentiment) %>%
filter(airline == "United")
afinn_counts_United %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "United",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
afinn_counts_virginamerica <- afinn_sentiment %>%
group_by(airline) %>%
count(sentiment) %>%
filter(airline == "Virgin America")
afinn_counts_virginamerica %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "Virgin America",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
To find out the difference between the portion of positive and negative tweets using the mentioned four lexicons, we would create summaries for each lexicon and bind them together.
summary_afinn2 <- sentiment_afinn %>%
group_by(airline) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(airline) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, airline) %>%
relocate(method)
summary_sentiment
## # A tibble: 48 x 4
## # Groups: airline [6]
## method airline sentiment n
## <chr> <chr> <chr> <int>
## 1 AFINN American negative 1565
## 2 AFINN American positive 735
## 3 AFINN Delta negative 831
## 4 AFINN Delta positive 712
## 5 AFINN Southwest negative 1004
## 6 AFINN Southwest positive 788
## 7 AFINN United negative 2160
## 8 AFINN United positive 1344
## 9 AFINN US Airways negative 1732
## 10 AFINN US Airways positive 676
## # ... with 38 more rows
Next, we would count the total number of words for each airline and merge that with summary_sentiment obtained earlier.
total_counts <- summary_sentiment %>%
group_by(method, airline)%>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "airline")
sentiment_counts
## # A tibble: 48 x 5
## # Groups: airline [6]
## method airline sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 AFINN American negative 1565 2300
## 2 AFINN American positive 735 2300
## 3 AFINN Delta negative 831 1543
## 4 AFINN Delta positive 712 1543
## 5 AFINN Southwest negative 1004 1792
## 6 AFINN Southwest positive 788 1792
## 7 AFINN United negative 2160 3504
## 8 AFINN United positive 1344 3504
## 9 AFINN US Airways negative 1732 2408
## 10 AFINN US Airways positive 676 2408
## # ... with 38 more rows
And obtain the percentage of positive and negative words for each airline.
sentiment_percents <- sentiment_counts %>% #calculates the percentage of positive and negative words
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 48 x 6
## # Groups: airline [6]
## method airline sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AFINN American negative 1565 2300 68.0
## 2 AFINN American positive 735 2300 32.0
## 3 AFINN Delta negative 831 1543 53.9
## 4 AFINN Delta positive 712 1543 46.1
## 5 AFINN Southwest negative 1004 1792 56.0
## 6 AFINN Southwest positive 788 1792 44.0
## 7 AFINN United negative 2160 3504 61.6
## 8 AFINN United positive 1344 3504 38.4
## 9 AFINN US Airways negative 1732 2408 71.9
## 10 AFINN US Airways positive 676 2408 28.1
## # ... with 38 more rows
Finally, we are able to draw the most comprehensive chart of this study which is stacked bar charts of airlines for each lexicon.
sentiment_percents %>%
ggplot(aes(x = airline, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
x = "Airline",
y = "Percentage of Words")
Narrate: Discussion, and Future Directions
Through conducting this sentiment analysis, I found the answer to all guiding questions. For RQ1, I realized that Virgin America has been able to achieve the highest satisfaction among Twitter users. For RQ2, I provided a cloud of words that shows top words used in Twitter to write about each airline. For RQ3, the last bar chart shows the difference between different lexicons. Interestingly, if we rank airlines in order of these percentages, there is almost no difference between lexicons, although rates vary significantly in some cases.
As a future direction, I would like to investigate metrics that consider the difference in the number of tweets about each airline. Specifically, we saw that the number of tweets about Virgin America was way lower than other airlines, although proportionally, Virgin America had the highest portion of positive tweets. I also like to train a classifier using results gained in this study and predict the sentiment of newer tweets without using lexicons to see how different the result would be. Another idea of mine is to take time into consideration to see how it can affect people’s opinions about airlines.