1. Prepare

1a. Some Context

Searching for an interesting dataset to do analysis on in Kaggle, I encountered a fascinating dataset containing tweets about different airlines. As transportation is an integral part of any trip, I found it useful and exciting to get some insights into the sentiment of Twitter users about popular airlines. Conducting text mining on these tweets could hopefully lead to a fair comparison between these airlines which would help customers to select an airline with more positive feedback.

1b. Guiding Questions

The Guiding Questions for this analysis are as follows:

  1. Which airline does have the highest satisfaction among its customers?

  2. What are the most commonly used words among Twitter users about airlines?

  3. How is the result of each lexicon different from the others for a similar dataset?

1c. Set Up

We add all packages that would be needed for this project.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(wordcloud2)

2. Wrangle

First, the dataset is imported:

tweets <- read_csv("data/tweets.csv")
## Rows: 14640 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): airline_sentiment, negativereason, airline, airline_sentiment_gold...
## dbl  (4): tweet_id, airline_sentiment_confidence, negativereason_confidence,...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tweets)
## # A tibble: 6 x 15
##   tweet_id airline_sentiment airline_sentiment~ negativereason negativereason_c~
##      <dbl> <chr>                          <dbl> <chr>                      <dbl>
## 1  5.70e17 neutral                        1     <NA>                      NA    
## 2  5.70e17 positive                       0.349 <NA>                       0    
## 3  5.70e17 neutral                        0.684 <NA>                      NA    
## 4  5.70e17 negative                       1     Bad Flight                 0.703
## 5  5.70e17 negative                       1     Can't Tell                 1    
## 6  5.70e17 negative                       1     Can't Tell                 0.684
## # ... with 10 more variables: airline <chr>, airline_sentiment_gold <chr>,
## #   name <chr>, negativereason_gold <chr>, retweet_count <dbl>, text <chr>,
## #   tweet_coord <chr>, tweet_created <chr>, tweet_location <chr>,
## #   user_timezone <chr>

Since there are some features in our dataset that we do not need, we pick the desired ones:

selected_tweets <- select(tweets, airline,  text)
selected_tweets
## # A tibble: 14,640 x 2
##    airline        text                                                          
##    <chr>          <chr>                                                         
##  1 Virgin America "@VirginAmerica What @dhepburn said."                         
##  2 Virgin America "@VirginAmerica plus you've added commercials to the experien~
##  3 Virgin America "@VirginAmerica I didn't today... Must mean I need to take an~
##  4 Virgin America "@VirginAmerica it's really aggressive to blast obnoxious \"e~
##  5 Virgin America "@VirginAmerica and it's a really big bad thing about it"     
##  6 Virgin America "@VirginAmerica seriously would pay $30 a flight for seats th~
##  7 Virgin America "@VirginAmerica yes, nearly every time I fly VX this “ear wor~
##  8 Virgin America "@VirginAmerica Really missed a prime opportunity for Men Wit~
##  9 Virgin America "@virginamerica Well, I didn't…but NOW I DO! :-D"             
## 10 Virgin America "@VirginAmerica it was amazing, and arrived an hour early. Yo~
## # ... with 14,630 more rows

In the next step, we tokenize and remove stop words from our data:

tokens_tweets <- 
  selected_tweets %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tokens_tweets
## # A tibble: 256,747 x 2
##    airline        word          
##    <chr>          <chr>         
##  1 Virgin America @virginamerica
##  2 Virgin America what          
##  3 Virgin America @dhepburn     
##  4 Virgin America said          
##  5 Virgin America @virginamerica
##  6 Virgin America plus          
##  7 Virgin America youve         
##  8 Virgin America added         
##  9 Virgin America commercials   
## 10 Virgin America to            
## # ... with 256,737 more rows
tidy_tweets <-
  tokens_tweets %>%
  anti_join(stop_words, by = "word")
tidy_tweets
## # A tibble: 120,547 x 2
##    airline        word          
##    <chr>          <chr>         
##  1 Virgin America @virginamerica
##  2 Virgin America @dhepburn     
##  3 Virgin America @virginamerica
##  4 Virgin America youve         
##  5 Virgin America added         
##  6 Virgin America commercials   
##  7 Virgin America experience    
##  8 Virgin America tacky         
##  9 Virgin America @virginamerica
## 10 Virgin America didnt         
## # ... with 120,537 more rows

One of the guiding questions at the beginning of this project is about the most repeated words in tweets. To find out the answer, the following code is run:

count_tweets <- count(tidy_tweets, word, sort = T)
count_tweets
## # A tibble: 17,145 x 2
##    word              n
##    <chr>         <int>
##  1 flight         3865
##  2 @united        3827
##  3 @usairways     2972
##  4 @americanair   2913
##  5 @southwestair  2426
##  6 @jetblue       2092
##  7 cancelled      1048
##  8 service         951
##  9 time            769
## 10 im              763
## # ... with 17,135 more rows

Taking a look at the result, we would find out that some meaningless words and tags are among top words. So we remove them:

tidy_tweets <-
  tokens_tweets %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "@united" & !word == "@usairways"
         & !word == "@virginamerica" & !word == "@southwestair"
         & !word == "@jetblue"& !word == "@americanair" & !word == "im" & !word == "amp" & word >= 50)

count_tweets <- count(tidy_tweets, word, sort = T)

Now, it is time to obtain sentiments using four lexicons: Bing, AFINN, Loughran, and NRC.

afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
loughran <- get_sentiments("loughran")
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 11,874 x 3
##    airline        word        value
##    <chr>          <chr>       <dbl>
##  1 Virgin America aggressive     -2
##  2 Virgin America obnoxious      -3
##  3 Virgin America bad            -3
##  4 Virgin America pay            -1
##  5 Virgin America bad            -3
##  6 Virgin America missed         -2
##  7 Virgin America opportunity     2
##  8 Virgin America amazing         4
##  9 Virgin America suicide        -2
## 10 Virgin America death          -2
## # ... with 11,864 more rows
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 11,199 x 3
##    airline        word       sentiment
##    <chr>          <chr>      <chr>    
##  1 Virgin America tacky      negative 
##  2 Virgin America aggressive negative 
##  3 Virgin America obnoxious  negative 
##  4 Virgin America bad        negative 
##  5 Virgin America bad        negative 
##  6 Virgin America missed     negative 
##  7 Virgin America parody     negative 
##  8 Virgin America amazing    positive 
##  9 Virgin America suicide    negative 
## 10 Virgin America leading    positive 
## # ... with 11,189 more rows
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")
sentiment_nrc
## # A tibble: 42,906 x 3
##    airline        word       sentiment
##    <chr>          <chr>      <chr>    
##  1 Virgin America trip       surprise 
##  2 Virgin America aggressive anger    
##  3 Virgin America aggressive fear     
##  4 Virgin America aggressive negative 
##  5 Virgin America blast      anger    
##  6 Virgin America blast      fear     
##  7 Virgin America blast      negative 
##  8 Virgin America blast      surprise 
##  9 Virgin America obnoxious  anger    
## 10 Virgin America obnoxious  disgust  
## # ... with 42,896 more rows
sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")
sentiment_loughran
## # A tibble: 7,533 x 3
##    airline        word        sentiment
##    <chr>          <chr>       <chr>    
##  1 Virgin America recourse    litigious
##  2 Virgin America bad         negative 
##  3 Virgin America bad         negative 
##  4 Virgin America missed      negative 
##  5 Virgin America opportunity positive 
##  6 Virgin America leading     positive 
##  7 Virgin America excited     positive 
##  8 Virgin America innovation  positive 
##  9 Virgin America miss        negative 
## 10 Virgin America worry       negative 
## # ... with 7,523 more rows

3. Explore

To get a general idea about the output of each lexicon on our data, we write the following code and see how many tokens have positive and negative sense:

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_bing
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   7364
## 2 positive   3835

Now, we can group them based on each airline. We would use mutate to add a column containing subtraction of positive sentiment from negative, which makes a comparison between airlines simpler.

summary_bing <- sentiment_bing %>% 
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)
summary_bing
## # A tibble: 6 x 5
## # Groups:   airline [6]
##   lexicon airline        negative positive sentiment
##   <chr>   <chr>             <int>    <int>     <int>
## 1 bing    American           1476      659      -817
## 2 bing    Delta               790      603      -187
## 3 bing    Southwest           886      737      -149
## 4 bing    United             2273     1006     -1267
## 5 bing    US Airways         1800      649     -1151
## 6 bing    Virgin America      139      181        42

We do the same for AFINN:

summary_afinn <- sentiment_afinn %>% 
  group_by(airline) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)
summary_afinn
## # A tibble: 6 x 3
##   lexicon airline        sentiment
##   <chr>   <chr>              <dbl>
## 1 AFINN   American           -1290
## 2 AFINN   Delta                180
## 3 AFINN   Southwest              0
## 4 AFINN   United             -1588
## 5 AFINN   US Airways         -1865
## 6 AFINN   Virgin America       176

And for NRC:

summary_nrc <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc")  %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive/negative)
summary_nrc
## # A tibble: 6 x 5
## # Groups:   airline [6]
##   airline        method negative positive sentiment
##   <chr>          <chr>     <int>    <int>     <dbl>
## 1 American       nrc        1343     1463     1.09 
## 2 Delta          nrc         823     1143     1.39 
## 3 Southwest      nrc         869     1379     1.59 
## 4 United         nrc        2125     2448     1.15 
## 5 US Airways     nrc        1641     1606     0.979
## 6 Virgin America nrc         166      288     1.73

Also for Loughran:

summary_loughran <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")  %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive/negative) %>% 
  relocate(method)
summary_loughran
## # A tibble: 6 x 5
## # Groups:   airline [6]
##   method   airline        negative positive sentiment
##   <chr>    <chr>             <int>    <int>     <dbl>
## 1 loughran American           1293      133     0.103
## 2 loughran Delta               653      131     0.201
## 3 loughran Southwest           831      186     0.224
## 4 loughran United             1830      260     0.142
## 5 loughran US Airways         1404      155     0.110
## 6 loughran Virgin America      119       41     0.345

4. Model

I found word2cloud the best tool to show the highly-used words in our data.

top_tokens<- tidy_tweets %>%
  count(word, sort = TRUE)%>%
  top_n(100)
## Selecting by n
wordcloud2(top_tokens, size = 1, shape = 'star')  

Also the number of tweets about each airline is demonstrated in the following bar chart.

ggplot(selected_tweets, aes(x = airline, fill = airline)) +
  geom_bar(width = .8, show.legend = FALSE) +
  xlab(label = "Airline") +
  ylab(label = "Number of Tweets")

Based on the above bar chart, the highest proportion of tweets belongs to “United” and the lowest belongs to “Virgin America”.

5. Communicate

Similar to Unit 2 Walkthrough, I think it would be helpful to measure an overall sentiment score for each tweet instead of each word.

polish_text <- tweets %>%
  select(tweet_id, airline, text) 

sentiment_afinn <- polish_text %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "@united" & !word == "@usairways"
         & !word == "@virginamerica" & !word == "@southwestair"
         & !word == "@jetblue"& !word == "@americanair" & !word ==            "im" & !word == "amp" & word > 50) %>%
  inner_join(afinn, by = "word")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
sentiment_afinn
## # A tibble: 11,874 x 4
##    tweet_id airline        word        value
##       <dbl> <chr>          <chr>       <dbl>
##  1  5.70e17 Virgin America aggressive     -2
##  2  5.70e17 Virgin America obnoxious      -3
##  3  5.70e17 Virgin America bad            -3
##  4  5.70e17 Virgin America pay            -1
##  5  5.70e17 Virgin America bad            -3
##  6  5.70e17 Virgin America missed         -2
##  7  5.70e17 Virgin America opportunity     2
##  8  5.70e17 Virgin America amazing         4
##  9  5.70e17 Virgin America suicide        -2
## 10  5.70e17 Virgin America death          -2
## # ... with 11,864 more rows

Now, we can find out the score for each tweet. In this step, scores equal to zero would be omitted and the rest would be categorized into positive and negative based on their value.

afinn_score <- sentiment_afinn %>% 
  group_by(airline, tweet_id) %>% 
  summarise(value = sum(value))
## `summarise()` has grouped output by 'airline'. You can override using the `.groups` argument.
afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

afinn_sentiment
## # A tibble: 7,573 x 4
## # Groups:   airline [6]
##    airline  tweet_id value sentiment
##    <chr>       <dbl> <dbl> <chr>    
##  1 American  5.69e17     1 positive 
##  2 American  5.70e17     2 positive 
##  3 American  5.70e17    -6 negative 
##  4 American  5.70e17    -1 negative 
##  5 American  5.70e17    -1 negative 
##  6 American  5.70e17    -1 negative 
##  7 American  5.70e17    -1 negative 
##  8 American  5.70e17     4 positive 
##  9 American  5.70e17     3 positive 
## 10 American  5.70e17    -2 negative 
## # ... with 7,563 more rows

This would be easier to interpret if we convert them into the ratio of negative tweets to positive:

afinn_ratio <- afinn_sentiment %>% 
  group_by(airline) %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

afinn_ratio
## # A tibble: 6 x 4
## # Groups:   airline [6]
##   airline        negative positive ratio
##   <chr>             <int>    <int> <dbl>
## 1 American            969      456 2.12 
## 2 Delta               531      507 1.05 
## 3 Southwest           664      537 1.24 
## 4 United             1341      771 1.74 
## 5 US Airways         1131      437 2.59 
## 6 Virgin America       96      133 0.722

The circle graphs of airlines with the highest and lowest ratio would also illustrate the significance of the ratio.

afinn_counts_United <- afinn_sentiment %>%
  group_by(airline) %>% 
  count(sentiment) %>%
  filter(airline == "United")

afinn_counts_United %>%
  ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "United",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

afinn_counts_virginamerica <- afinn_sentiment %>%
  group_by(airline) %>% 
  count(sentiment) %>%
  filter(airline == "Virgin America")

afinn_counts_virginamerica %>%
  ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Virgin America",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

To find out the difference between the portion of positive and negative tweets using the mentioned four lexicons, we would create summaries for each lexicon and bind them together.

summary_afinn2 <- sentiment_afinn %>% 
  group_by(airline) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(airline) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran") 

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, airline) %>%
  relocate(method)
summary_sentiment
## # A tibble: 48 x 4
## # Groups:   airline [6]
##    method airline    sentiment     n
##    <chr>  <chr>      <chr>     <int>
##  1 AFINN  American   negative   1565
##  2 AFINN  American   positive    735
##  3 AFINN  Delta      negative    831
##  4 AFINN  Delta      positive    712
##  5 AFINN  Southwest  negative   1004
##  6 AFINN  Southwest  positive    788
##  7 AFINN  United     negative   2160
##  8 AFINN  United     positive   1344
##  9 AFINN  US Airways negative   1732
## 10 AFINN  US Airways positive    676
## # ... with 38 more rows

Next, we would count the total number of words for each airline and merge that with summary_sentiment obtained earlier.

total_counts <- summary_sentiment %>%
  group_by(method, airline)%>%
  summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "airline")
sentiment_counts
## # A tibble: 48 x 5
## # Groups:   airline [6]
##    method airline    sentiment     n total
##    <chr>  <chr>      <chr>     <int> <int>
##  1 AFINN  American   negative   1565  2300
##  2 AFINN  American   positive    735  2300
##  3 AFINN  Delta      negative    831  1543
##  4 AFINN  Delta      positive    712  1543
##  5 AFINN  Southwest  negative   1004  1792
##  6 AFINN  Southwest  positive    788  1792
##  7 AFINN  United     negative   2160  3504
##  8 AFINN  United     positive   1344  3504
##  9 AFINN  US Airways negative   1732  2408
## 10 AFINN  US Airways positive    676  2408
## # ... with 38 more rows

And obtain the percentage of positive and negative words for each airline.

sentiment_percents <- sentiment_counts %>% #calculates the percentage of positive and negative words
  mutate(percent = n/total * 100)

sentiment_percents
## # A tibble: 48 x 6
## # Groups:   airline [6]
##    method airline    sentiment     n total percent
##    <chr>  <chr>      <chr>     <int> <int>   <dbl>
##  1 AFINN  American   negative   1565  2300    68.0
##  2 AFINN  American   positive    735  2300    32.0
##  3 AFINN  Delta      negative    831  1543    53.9
##  4 AFINN  Delta      positive    712  1543    46.1
##  5 AFINN  Southwest  negative   1004  1792    56.0
##  6 AFINN  Southwest  positive    788  1792    44.0
##  7 AFINN  United     negative   2160  3504    61.6
##  8 AFINN  United     positive   1344  3504    38.4
##  9 AFINN  US Airways negative   1732  2408    71.9
## 10 AFINN  US Airways positive    676  2408    28.1
## # ... with 38 more rows

Finally, we are able to draw the most comprehensive chart of this study which is stacked bar charts of airlines for each lexicon.

sentiment_percents %>%
  ggplot(aes(x = airline, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter",
       x = "Airline", 
       y = "Percentage of Words")

Narrate: Discussion, and Future Directions

Through conducting this sentiment analysis, I found the answer to all guiding questions. For RQ1, I realized that Virgin America has been able to achieve the highest satisfaction among Twitter users. For RQ2, I provided a cloud of words that shows top words used in Twitter to write about each airline. For RQ3, the last bar chart shows the difference between different lexicons. Interestingly, if we rank airlines in order of these percentages, there is almost no difference between lexicons, although rates vary significantly in some cases.

As a future direction, I would like to investigate metrics that consider the difference in the number of tweets about each airline. Specifically, we saw that the number of tweets about Virgin America was way lower than other airlines, although proportionally, Virgin America had the highest portion of positive tweets. I also like to train a classifier using results gained in this study and predict the sentiment of newer tweets without using lexicons to see how different the result would be. Another idea of mine is to take time into consideration to see how it can affect people’s opinions about airlines.