Executive summary

We want to examine how global warming is responding to climate issues based on whether we believe it actually exists. Groups who believe that the climate crisis exists are aware of more negative problems such as environmental pollution or whether they are talking about it more positively, such as how to overcome it. And we will examine in detail how groups who do not believe in the climate crisis are accepting the climate crisis discourse. Therefore, TF-IDF will be used to look at the high-frequency words for each climate crisis recognition group and then visualize them. And using the TF-IDF value, an emotional analysis is conducted to see what emotional words appeared frequently for each item. Finally, we will visualize what words appear most frequently for each item and what other words each word is connected to through an n-gram.

Data background

This data was obtained from Sentiment of Climate Change in ‘date.word’. This data was created by Kent Cavender-Bares on August 30, 2013, and evaluated tweets about whether they believe there is global warming or climate change. It consists of three main items. In the case of the ‘YES’, it is a tweet that recognizes that global warming is occurring, and the ‘NO’ is a case of distrusting the occurrence of global warming. The ‘N/A’ is classified as having ambiguous tweets or not related to global warming.

Data loading, cleaning and preprocessing

The existing data consisted of several items such as YES, NO, Y, N, and N/A. In the case of YES, a group that believes that there is a climate crisis, the data composition was somewhat not neat, such as Y group. Accordingly, the data were arranged to consist of three items: YES, NO, and N/A. The csv file downloaded in advance through ‘read.csv’ was loaded and assigned to an object called ‘tweet’. First of all, ‘existence’, which was the name of the existing data column, was changed to ‘climate’, so that if there is a climate crisis, it is divided into Yes, if there is a climate crisis, NO for tweets without a climate crisis, and N/A for ambiguous tweets. To this end, N and NO were unified as NO, and Y and YES were unified as YES. On the other hand, while there were tweets separated by N/A, some tweets were not separated by these items themselves. Therefore, the following code was written to form a ‘climate’ column with three items: YES, NO, and N/A without including tweets that were not distinguished in advance. <mutate(climate = ifelse(existence==“Y”, “Yes”, ifelse(existence == “N”, “No”, ifelse(existence==“N/A”, “N/A”, existence))))>

Since then, the tweet has been tokenized as a word using ‘unnest_tokens’, and unnecessary words have been removed primarily through ‘anti_join (stop_words). Existing data is a collection of tweets, and the tweet text includes ’RT’, which re-uploads the post within Twitter, so the word ‘RT’ has been sampled excessively. Also, there were many cases where links were uploaded along with the tweet content. In order to remove these additional words, unnecessary words were individually selected and removed using ‘filter’.

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

TF_IDF was used to find out which word appears relatively frequently for each YES, NO, and N/A item. We wanted to see what word choice difference exists between tweets that believe the climate crisis exists and tweets that do not. And we analyzed whether they believe the climate crisis exists or whether they have a positive or negative attitude toward global warming and the climate crisis issue depending on their response.

First, the tf_idf value was calculated for each of the three words using ‘bind_tf_idf’. After that, the top 10 were selected based on the tf_idf value through ‘slice_max’, and ‘with_ties = F’ was set so as not to select the same value additionally to obtain 10 words for each of the three items.

When making the graph, the surface was separated based on YES, NO, and N/A, and the frequency was calculated, so it was visualized using a bar graph. And for the readability of the graph, ‘coord_flip’ was added to place each word on the y-axis. In addition, ‘x = reorder_within (word, tf_idf, climate)’ was done in advance to sort the frequency of each climate, and the name was reset through ‘scale_x_reordered’ when creating the graph afterwards.

YES The most frequently mentioned words are ‘wildlife’, ‘reduce’, and ‘habitat’. In particular, words such as ‘reduce’, ‘protect’, and ‘worse’ appeared more frequently than other items. Through this, it was found that tweets believing that there is a climate crisis are aware of the seriousness of the global warming problem and are using words that persuade and appeal to solve it. And words such as ‘wildlife’, ‘snow’, ‘plants’, and ‘weather’ are used a lot. Those who believe that the climate crisis exists can see that they are aware of the problem with the weather and ecosystem as a whole.
NO The most frequently appeared words are ‘snow’, ‘goore’, and ‘utah’. In the case of this item, words related to climate such as ‘snow’, ‘weather’, and ‘winter’ appeared. However, looking at the response to the climate crisis, we can see that many negative words are used to deny the existence of the climate crisis itself, such as ‘conspiracy’ and ‘scam’. It can be seen that the climate crisis such as global warming is regarded as a lie created by a fanatic group.
N/A When the data was first obtained through the ‘data.word’, ambiguous tweets that are difficult to know the existence of the climate crisis were classified as N/A. In fact, when TF_IDF was applied to the N/A item, the most frequently appeared words were clinique, clip, trals, etc. Overall, it can be seen that the atmosphere of words is not relatively unified and is dispersed in various topics. Through this, it can be seen that N/A does not deal with climate crisis-related contents.

tweet_tf_idf <- tidy_tweet %>%
  bind_tf_idf(term = word,           
              document = climate, 
              n = n) %>% 
  group_by(climate) %>%
  slice_max(tf_idf, n = 10, with_ties = F)
tweet_tf_idf

## # A tibble: 30 × 6
## # Groups:   climate [3]
##    climate word          n      tf   idf  tf_idf
##    <chr>   <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 N/A     clinique     16 0.00370 1.10  0.00407
##  2 N/A     clip         15 0.00347 1.10  0.00382
##  3 N/A     clinical     35 0.00810 0.405 0.00329
##  4 N/A     trials       35 0.00810 0.405 0.00329
##  5 N/A     collagen     28 0.00648 0.405 0.00263
##  6 N/A     graham's     24 0.00556 0.405 0.00225
##  7 N/A     screaming    19 0.00440 0.405 0.00178
##  8 N/A     cnnbrk        7 0.00162 1.10  0.00178
##  9 N/A     limbo        15 0.00347 0.405 0.00141
## 10 N/A     exit         14 0.00324 0.405 0.00131
## # ℹ 20 more rows

ggplot(tweet_tf_idf, aes(x = reorder_within(word, tf_idf, climate),
                  y = tf_idf,
                  fill = climate)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ climate, scales = "free", ncol = 3) +
  scale_x_reordered() +
  labs(x = NULL, y = NULL) +
  ggtitle('The Climate Crisis Tweet : TF_IDF')

Anaysis and Figure 2

Emotion analysis using afinn was conducted based on the TF-IDF value that weights words that appear more frequently in the document compared to other documents.

YES Looking at the YES chart, it can be seen that positive and negative words occupy a similar proportion. Positive words include ‘gift’, ‘cleaner’, and ‘solutions’. In the case of positive words, it can be said that words that urge action to overcome the climate crisis mainly appeared. On the other hand, among the negative words, words such as ‘worse’, ‘kills’, and ‘blame’ appeared. While positive words for the campaign to overcome the climate crisis frequently appeared, extreme words that signify the seriousness of the climate crisis were being used.
NO There was only one positive word, and all other words were found to have negative emotions. Words such as ‘fraud’ and ‘bulshit’ appeared as the most negative vocabulary, and it can be seen that tweets with NO items were written mainly in the context of lies, fraud, and conspiracies in the climate crisis such as global warming.
N/A Positive words were ‘amaze’ and ‘advantages’, and negative words were ‘scam’, ‘dispute’, and ‘stalling’. Compared to NO, positive words appeared frequently, but it can be seen that a large number of negative words were used in N/A as well. In the existing original data, it was mentioned that tweets that are ambiguous whether it is related to global warming were classified as N/A. It was difficult to clearly know the words within N/A through TF-IDF. However, when analyzing emotions, seeing that words such as ‘polluter’ are appearing, it can be expected that there will be some tweets related to environmental issues.

In order to check what kind of emotional state words appear a lot according to the ‘climate’, ‘afinn’ was used among the emotional vocabulary Lexicons. In the case of ‘afinn’, it has a characteristic of valuing each word from -5 to +5 what the emotion is. In the case of bing and nrc, words are classified according to specific emotional states such as positive, negative, or joy. When using bing and nrc, there were not many common words between the data to be analyzed and the corresponding vocabulary lexicon, and afinn that can analyze the emotions of relatively many words was selected. Emotion analysis was conducted using ‘tweet_tf_idf’, a data set containing TF-IDF values. We used ‘inner_join’ to find an intersection with afinn, grouped by climate, and selected each of the top 10 words based on the tf_idf value using ‘slice_max’. After that, when visualizing, we went through the process of sorting by value to sort the charts in order. When visualizing, just like the previously drawn bar graph of TF-IDF, the readability is poor when words are located on the existing x-axis, so the graph was created by changing the x-axis and y-axis to each other.

tweet_tf_idf <- tidy_tweet %>%
  bind_tf_idf(term = word,           
              document = climate, 
              n = n)

climate_sentiment <- tweet_tf_idf %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(climate) %>% 
  slice_max(tf_idf, n = 10, with_ties = F) %>% 
  mutate(word = factor(word, levels = word[order(value)]))

## Joining with `by = join_by(word)`

ggplot(climate_sentiment, aes(x = word,
                  y = value,
                  fill = climate)) +
  geom_col(show.legend = F) +
  facet_wrap(~ climate, scales = "free_y", ncol = 3) +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL, y = NULL) +
  ggtitle('The Climate Crisis Tweet : TF_IDF Sentiment')

Anaysis and Figure 3

The n-gram was used to find out the relationship between the words most frequently mentioned according to the climate and which words each word relates to. First, YES, NO, and N/A were tokenized into two words each, and the two words were separated into two columns, word1 and word2. In addition, unnecessary words were individually entered and removed using ‘filiter’ for the added columns. For network analysis visualization by dividing the clusters by related words, set.seed (1234) was set to divide the clusters, and to set different colors for each cluster. After that, more than 15 frequencies of words were selected. After that, it is a reference value set in consideration of performing NO and N/A visualization. Lines connecting words were set in gray. In particular, through codes such as <geom_node_point(esize=centrality, color=group), and show.legend=F), the size of the node was set to be changed according to the frequency attributes. In addition, the color was also designated to vary according to the attributes of each group.

YES ‘climate’ is one of the words that are connected to the most words. It can be seen as a key word that is connected to other key words such as ‘global’, ‘fighting’, and ‘change’. Looking at the network, it can be seen that life, welfare, and protection are connected, and ecosystem crisis problems are frequently appearing as an extension of the climate crisis. In addition, it can be seen that the climate crisis is frequently appearing in tweets recognizing that global, speed, and reduction are interconnected and that a global climate crisis response must be made.
NO Many words related to climate appeared in this image as well. However, unlike YES, it can be seen that global warming is connected to the word that means fabricated. In addition to global warming, some words were linked in pairs, and tweets that did not recognize the climate crisis were negatively perceived as incited falsehoods and not very frequent.
N/A Unlike TF-IDF, words related to the environment appear relatively, and you can intuitively check which words are connected to each other. In the case of N/A, it can be seen that environmental contents are also appearing. Climate and change are interconnected. In addition, environmental issues were discussed through the connection of global and warm, but it was confirmed that it was difficult to measure the specific dimension of existence.

tweet_bigram <- tweet %>%
  unnest_tokens(input = tweet,
                output = bigram,
                token = "ngrams",
                n = 2)
sperate_tweet <- tweet_bigram %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>% 
  filter(!word1 %in% c("link", "rt", "http", "bit.ly", "retwt.me", "ow.ly", "tinyurl.com", "oohja.com", "_o", "url4"),
           !word2  %in% c("link", "rt", "http", "bit.ly", "retwt.me", "ow.ly", "tinyurl.com", "oohja.com", "_o", "url4"))

#YES
yes_pair_tweet <- sperate_tweet %>%
  filter(climate == "Yes") %>% 
  count(word1, word2, sort = T) %>%
  na.omit()


yes_graph <- yes_pair_tweet %>%
  filter(n > 15) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap()))

set.seed(1234)
ggraph(yes_graph, layout = "fr") +
  geom_edge_link(color = "gray50",
                 alpha = 0.5) +
  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +
  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +
  theme_graph() +
  ggtitle('The Climate Crisis Tweet : Yes n-gram')

#NO
no_pair_tweet <- sperate_tweet %>%
  filter(climate == "No") %>% 
  count(word1, word2, sort = T) %>%
  na.omit()

no_graph <- no_pair_tweet %>%
  filter(n > 15) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap()))

ggraph(no_graph, layout = "fr") +
  geom_edge_link(color = "gray50",
                 alpha = 0.5) +
  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +
  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +
  theme_graph() +
  ggtitle('The Climate Crisis Tweet : NO n-gram')

#N/A
na_pair_tweet <- sperate_tweet %>%
  filter(climate == "N/A") %>% 
  count(word1, word2, sort = T) %>%
  na.omit()

na_graph <- na_pair_tweet %>%
  filter(n > 15) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap()))

ggraph(na_graph, layout = "fr") +
  geom_edge_link(color = "gray50",
                 alpha = 0.5) +
  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +
  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +
  theme_graph() +
  ggtitle('The Climate Crisis Tweet : N/A n-gram')

The Climate Crisis Tweet Analysis

Park So-Ri

2024.06.11