Collecting & Analyzing Big Data for Social Sciences

Assignment 2: Research Notebook

Group Members: Sercan Kıyak (0650472), Hatem Alharazin (0702702), Mathias Vasilev (0884889)

Overview

  1. Research Question
  2. Data & Research Design
  3. Analysis
  4. Results & Discussion
  5. Conclusion
  6. Elevator Pitch

1. Research Question

The aim of this research notebook is to analyse the online political communication related to the invasion of Ukraine in Germany and Austria.

This political event has many political, social and economical implications for the future of Europe, such as Ukranian refugee crisis, military policies and rising gas prices. So far, many politicians had to take sides in relation to this invasion and communicated their positions on the social media sites, like Twitter. The importance of Twitter for dissemination of ideas in political discussions can not be overstated in contemporary societies. Moreover, thanks to its advanced API, Twitter offers many opportunities for digital sociology and social media studies to monitor political trends and social changes.

In the following section, the research design and data will be described. Then, various computational analyses will be conducted and results will be visualized. That will be followed by a discussions and a conclusion section. Finally, we will offer a summary elevator pitch of our paper.

2. Data & Research Design

In order to answer our research question, we will focus on data collected from a list of political parties and politicians from both countries. We acquired the data from iCandid and with the permission of Prof. Leen d’Haenens. The original dataset consisted of tweets and their metadata between 2015 and 2022 and is collected based on a list of Twitter handles of politicians and parties of 4 European countries (Germany, Austria, Italy and Hungary).

We chose to focus on Germany and Austria in our research, due to 3 reasons: 1) They both have important ties and connections to Russia, yet they are also both EU members, 2) their national language is German, making certain NLP applications and analysis easier and more consistent. 3) Our team’s language capabilities and familiarity with their national contexts.

Time period: The dataset covers 1st January 2015 to 3rd April 2022. We The occupation started on 24th February. However, we wanted to monitor the emergence of the crisis and included all the tweets from 2022.

Dataest Features: * Id: Unique tweet id of the tweet Type: iCandid generated item type. All data has the type “Message” Author: The original author of the tweet * Text: The text of the tweet * Sender: The sending account of the tweet. The values in this feature correspond to our list of accounts * datePublished: Date of the tweet * Url: Url address of the tweet * Keywords: Hashtags contained in the text * Mentions: Mentions contained in the text * Country: The country of the origin of the account We focused on sender, text, date and hashtags in order to apply our computational analyses.

This research project has the goal of comparing the two different countries with certain historical, social and linguistic ties and find similarities and differences in their online political discussions of the invasion of Ukraine. In order to do so, we will first clean the data. Then we will proceed with some descriptive statistics based on party and date based visualisations and tables. Then we will move on to the linguistic analyses starting with wordclouds. Then we will move on to more advanced NLP applications, namely sentiment analysis and topic detection of the tweets. In all these methods we will pay special attention to 1) any similar trends and patterns emerging between the two countries and 2) differences in linguistic expressions and social media grammars. Moreover, we will also try to detect prominent political actors in relation to Ukraine crisis in each country.

3. Analysis

Step 1: recall the needed pakages

library(knitr)
library(glue)
library(tidyverse)
library(readr)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(wordcloud2)
library(tm)
library(SnowballC)
library(RCurl)
library(XML)
library(tidytext)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(udpipe)
library(spacyr)
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(topicmodels)
library(cowplot)

Step 2: Import the data

The dataset consists of 2 .csv files, one for Germany and another for Austria.

opts_knit$set(progress=FALSE, verbose=FALSE)
twitter_Germany <- read_delim("twitterDuitsePoliticiAccount.csv", 
    delim = "\t", escape_double = FALSE, 
    trim_ws = TRUE)
## Rows: 284198 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (8): id, type, author, text, sender, url, keywords, mentions
## date (1): datePublished
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View(twitter_Germany)

opts_knit$set(progress=FALSE, verbose=FALSE)
twitter_Austria <- read_delim("twitterOostenrijksePoliticiAccount.csv", 
    delim = "\t", escape_double = FALSE, 
    trim_ws = TRUE)
## Rows: 188936 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (8): id, type, author, text, sender, url, keywords, mentions
## date (1): datePublished
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View(twitter_Austria)

A look into the dataframes: Germany

head(twitter_Germany)

Austria

head(twitter_Austria)

Step 3: Filtering the time period

Select the time period of the study and the required variables.

# Select tweets through 2022-01-01 till 2022-04-03
tweets_Germany <- filter(twitter_Germany, 
                  datePublished > "2022-01-01")

tweets_Austria <- filter(twitter_Austria, 
                  datePublished > "2022-01-01")

# select the targeted variables
tweets_Germany <- select(tweets_Germany, author, text, sender, datePublished, keywords, mentions)
tweets_Austria <- select(tweets_Austria, author, text, sender, datePublished, keywords, mentions)

Step 4: Twitter hashtag frequency

In this part, the frequency of Ukraine in both countries is calculated and visualized.

Germany

hashtags_Germany <- data.frame(str_split_fixed(tweets_Germany$keywords, ",", 10), datePublished = tweets_Germany$datePublished, sender = tweets_Germany$sender, mentions = tweets_Germany$mentions)
hashtags_Germany <- tibble(hashtags_Germany)

hashtags_Germany$X1 <- str_detect(toupper(str_squish(hashtags_Germany$X1)), "UKRAIN.*")
hashtags_Germany$X2 <- str_detect(toupper(str_squish(hashtags_Germany$X2)), "UKRAIN.*")
hashtags_Germany$X3 <- str_detect(toupper(str_squish(hashtags_Germany$X3)), "UKRAIN.*")
hashtags_Germany$X4 <- str_detect(toupper(str_squish(hashtags_Germany$X4)), "UKRAIN.*")
hashtags_Germany$X5 <- str_detect(toupper(str_squish(hashtags_Germany$X5)), "UKRAIN.*")
hashtags_Germany$X6 <- str_detect(toupper(str_squish(hashtags_Germany$X6)), "UKRAIN.*")
hashtags_Germany$X7 <- str_detect(toupper(str_squish(hashtags_Germany$X7)), "UKRAIN.*")
hashtags_Germany$X8 <- str_detect(toupper(str_squish(hashtags_Germany$X8)), "UKRAIN.*")
hashtags_Germany$X9 <- str_detect(toupper(str_squish(hashtags_Germany$X9)), "UKRAIN.*")
hashtags_Germany$X10 <- str_detect(toupper(str_squish(hashtags_Germany$X10)), "UKRAIN.*")

hashtags_Germany$Count_g <- hashtags_Germany$X1+hashtags_Germany$X2+hashtags_Germany$X3+hashtags_Germany$X4+hashtags_Germany$X5+hashtags_Germany$X6+hashtags_Germany$X7+hashtags_Germany$X8+hashtags_Germany$X9+hashtags_Germany$X10

# This all data created in new object to be used later to extract the tweets that are related to Ukriane.
Ukriane_tweets_g <- cbind(hashtags_Germany, text = tweets_Germany$text)

Austria

hashtags_Austria <- data.frame(str_split_fixed(tweets_Austria$keywords, ",", 10), datePublished = tweets_Austria$datePublished, sender = tweets_Austria$sender, mentions = tweets_Austria$mentions)
hashtags_Austria <- tibble(hashtags_Austria)

hashtags_Austria$X1 <- str_detect(toupper(str_squish(hashtags_Austria$X1)), "UKRAIN.*")
hashtags_Austria$X2 <- str_detect(toupper(str_squish(hashtags_Austria$X2)), "UKRAIN.*")
hashtags_Austria$X3 <- str_detect(toupper(str_squish(hashtags_Austria$X3)), "UKRAIN.*")
hashtags_Austria$X4 <- str_detect(toupper(str_squish(hashtags_Austria$X4)), "UKRAIN.*")
hashtags_Austria$X5 <- str_detect(toupper(str_squish(hashtags_Austria$X5)), "UKRAIN.*")
hashtags_Austria$X6 <- str_detect(toupper(str_squish(hashtags_Austria$X6)), "UKRAIN.*")
hashtags_Austria$X7 <- str_detect(toupper(str_squish(hashtags_Austria$X7)), "UKRAIN.*")
hashtags_Austria$X8 <- str_detect(toupper(str_squish(hashtags_Austria$X8)), "UKRAIN.*")
hashtags_Austria$X9 <- str_detect(toupper(str_squish(hashtags_Austria$X9)), "UKRAIN.*")
hashtags_Austria$X10 <- str_detect(toupper(str_squish(hashtags_Austria$X10)), "UKRAIN.*")

hashtags_Austria$Count_a <- hashtags_Austria$X1+hashtags_Austria$X2+hashtags_Austria$X3+hashtags_Austria$X4+hashtags_Austria$X5+hashtags_Austria$X6+hashtags_Austria$X7+hashtags_Austria$X8+hashtags_Austria$X9+hashtags_Austria$X10

# This all data created in new object to be used later to extract the tweets that are related to Ukriane.
Ukriane_tweets_a <- cbind(hashtags_Austria, text = tweets_Austria$text)

These two tables above show that parties in Germany tweet more compared to individual politicians, while politicians from Austria tweet more than the parties.

NOTE would not it be better 1) we limit these tables to top 10 (they are too long) 2) have them as a bar charts

NOTE 2 can not we also get the same bar charts/tables with Ukraine hashtags/or contains ukraine. Then we can normalize them. Then we would see (for example in percentages) which party or political leader devoted more tweets to Ukraine. That would be a great discussion

Step 4: Tweet count by account

It is informative to see who tweeted the most during the relevant period. In order to see that we will construct simple a frequency list of tweets by the account who sent the tweets.

Germany

# All tweets
## Germany
Sender_freq_g <- str_squish(unlist(na.omit(toupper(str_squish(tweets_Germany$sender)))))
Senders <- data.frame(sort(table(Sender_freq_g), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_g, Freq=Senders$Freq)
f1 <- df[1:10,]

## Austria
Sender_freq_a <- str_squish(unlist(na.omit(toupper(str_squish(tweets_Austria$sender)))))
Senders <- data.frame(sort(table(Sender_freq_a), decreasing=TRUE))

df <- data.frame(Sender = Senders$Sender_freq_a, Freq=Senders$Freq)
f2 <- df[1:10,]


# Ukraine tweets
## Germany
Sender_freq_g <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,]
Sender_freq_g <- str_squish(unlist(na.omit(toupper(str_squish(Sender_freq_g$sender)))))
Senders <- data.frame(sort(table(Sender_freq_g), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_g, Freq=Senders$Freq)
f3 <- df[1:10,]

## Austria
Sender_freq_a <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]
Sender_freq_a <- str_squish(unlist(na.omit(toupper(str_squish(Sender_freq_a$sender)))))
Senders <- data.frame(sort(table(Sender_freq_a), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_a, Freq=Senders$Freq)
f4 <- df[1:10, ]

f_all <- data.frame(f1,f2,f3,f4)
names(f_all) <- c("Senders_1", "Freq_1", "Sender_2", "Freq_2", "Senders_3", "Freq_3", "Senders_4", "Freq_4")
kable(f_all, col.names = c("All tweets senders (Germany)","Number of tweets","All tweets senders (Austria)","Number of tweets","Ukraine tweets senders (Germany)","Number of tweets","Ukraine tweets senders (Austria)","Number of tweets"))
All tweets senders (Germany) Number of tweets All tweets senders (Austria) Number of tweets Ukraine tweets senders (Germany) Number of tweets Ukraine tweets senders (Austria) Number of tweets
SPD PARTEIVORSTAND 🇪🇺 1697 RUDI ANSCHOBER 2277 SPD PARTEIVORSTAND 🇪🇺 129 RUDI ANSCHOBER 98
CDU/CSU 1156 PETER PILZ 1353 CDU/CSU 105 PETER PILZ 17
FDP 1008 DAS NEUE ÖSTERREICH 560 CEM ÖZDEMIR 64 DAS NEUE ÖSTERREICH 12
JOANA COTAR 665 BEATE MEINL-REISINGER 522 FDP 59 DIE GRÜNEN 12
CSU 623 SPÖ 496 CDU DEUTSCHLANDS 48 HAGEN REINHOLD, MDB 11
CEM ÖZDEMIR 587 FPÖ 282 CSU 42 WERNER KOGLER 10
CDU DEUTSCHLANDS 540 MATTHIAS STROLZ 276 MARKUS SÖDER 34 SPÖ 9
DIE LINKE 520 DIE GRÜNEN 236 DIE LINKE 32 MATTHIAS STROLZ 7
ALTERNATIVE FÜR 🇩🇪 DEUTSCHLAND 449 WERNER KOGLER 199 CHRISTIAN LINDNER 29 BEATE MEINL-REISINGER 6
MARKUS SÖDER 407 HAGEN REINHOLD, MDB 143 SAHRA WAGENKNECHT 26 PAMELA RENDI-WAGNER 5

Step 6: Ukraine-hashtag Timeline

In this part, we will plot the sum of Ukraine related hastags in the given period. NOTE we should do this at least on the basis of weeks if not days. Then we can see better and speak about the plot more. Monthly division is too broad.

Agg_hashtags_g <- aggregate(Count_g ~ datePublished, data = hashtags_Germany, sum)
Agg_hashtags_a <- aggregate(Count_a ~ datePublished, data = hashtags_Austria, sum)
plot(Agg_hashtags_g$datePublished, Agg_hashtags_g$Count, type = "l", xlab = "Date", ylab = "Number of hashtaging UKRAINE")
lines(Agg_hashtags_a$datePublished, Agg_hashtags_a$Count, col = "red", type = "l")
legend("topleft", legend=c("Germany", "Austria"),
       col=c("Black", "Red"), lty=1, cex=0.8)

There were some initial political reactions from Germany before the invasion, while Austrian actors remained mostly silent on the issue. Unsurprisingly the tweets dramatically escalate as the invasion began in both countries. There are more Ukraine-hashtagged tweets from German actors, however we follow more users from there and we have more data. Thus this appereance can be related to the simple issue of quantity of data. Finally, German actors continue to tweet more about Ukraine, while the debate in Austria goes down.

Step 7: Wordclouds

There are two relevant ways of generating wordclouds in our data: 1) Based on the hashtags and 2) based on the tweet texts. We will conduct both starting with the hashtags.

Hashtags Wordclouds

In this part, we will generate wordclouds to see the main hashtags in our target time period.

# Germany
hashtags_freq_g <- str_squish(unlist(str_split(na.omit(toupper(str_squish(tweets_Germany$keywords))), ",")))
docs <- Corpus(VectorSource(hashtags_freq_g))
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,  max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

kable(df[1:30, ], caption = "Germany")
Germany
word freq
ukraine ukraine 624
afd afd 303
bundestag bundestag 205
impfpflicht impfpflicht 194
corona corona 188
ampel ampel 166
russland russland 134
putin putin 126
bundesversammlung bundesversammlung 118
bundesregierung bundesregierung 107
teamcdu teamcdu 104
3k22 3k22 92
cdupt22 cdupt22 91
steinmeier steinmeier 85
aufinsneue aufinsneue 82
cdu cdu 81
deutschland deutschland 80
bundespräsident bundespräsident 78
saarland saarland 78
europa europa 65
dbdk22 dbdk22 64
freiheit freiheit 63
bundeswehr bundeswehr 60
omikron omikron 59
spd spd 57
habeck habeck 55
inflation inflation 54
bayern bayern 51
scholz scholz 51
standwithukraine standwithukraine 48
df1 <- df[1:15,]

The wordcloud of hashtags from Germany and its table representation shows AFD (Alternative für Deutschland) is prominently hastagged.There are also tweets related to national celebrations. Ukraine is the most mentioned hashtag, while Russia and Putin is also addressed.

# Austria
hashtags_freq_a <- str_squish(unlist(str_split(na.omit(toupper(str_squish(tweets_Austria$keywords))), ",")))
docs <- Corpus(VectorSource(hashtags_freq_a))
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,  max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

kable(df[1:30, ], caption = "Austria")
Austria
word freq
ukraine ukraine 147
bmichats bmichats 114
övp övp 91
sobotka sobotka 79
oevpua oevpua 58
zib2 zib2 44
oenr oenr 44
rotesfoyer rotesfoyer 40
oevpkorruptionsua oevpkorruptionsua 37
putin putin 36
longcovid longcovid 31
oevp oevp 28
russland russland 27
breaking breaking 23
covid19 covid19 22
omikron omikron 22
standwithukraine standwithukraine 20
kloibmüller kloibmüller 20
imzentrum imzentrum 19
wksta wksta 19
einland einland 17
hessenthaler hessenthaler 17
yeswecare yeswecare 16
covid covid 16
covid19at covid19at 15
neutralität neutralität 15
sideletter sideletter 15
nehammer nehammer 13
weremember weremember 13
wolf wolf 12
df2 <- df[1:15,]
par(mfrow = c(1,2))
barplot(df1$freq, names.arg = df1$word, las=2, col = 2, ylab = "Word frequency", xlab = "Hashtags", main = "Germany")

barplot(df2$freq, names.arg = df2$word, las=2, col = 3, ylab = "Word frequency", xlab = "Hashtags", main = "Austria")

NOTE the label of X axis collides with the long hastags on my screen can we disable the label on the x axis pls? Also there are more bars then there are hashtags in the X-axis. so it is not really helpful right now

Ukraine is also the leading hashtag in Austria for the given time period. However, unlike Germany many other more national debates follow it in hastags. Putin is just the 10th most occurring hashtags.

The code below generates 4 graphs in one frame. In order to show the relative frequency of all tweets versus Ukraine-related ones.

# # All Tweets
#
# # Germany
tweets_per_week = tweets_Germany %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g1 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="yellow2") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("All Tweets - Germany")
# # Austria
tweets_per_week = tweets_Austria %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g2 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="red3") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("All Tweets - Austria")
#
# # Ukraine Tweets
dfdu <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,]
dfau <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]

# # Germany
tweets_per_week = dfdu %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g3 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="khaki") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("Ukraine Tweets - Germany")
# # Austria
tweets_per_week = dfau %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g4 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="indianred1") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("Ukraine Tweets - Austria")

plot_grid(g1, g2, g3, g4, labels ="AUTO")

# # Tweets over time, all and Ukraine related ones plot

This 4-bar chart demonstrates that political communication in Austria intensified with the invasion. More tweets were sent by parties and politicians compared to the relative low numbers in earlier 2022. Apart from this, the sharp rise with the invasion in number of tweets and their decline follow a similar pattern.

Step 7: Sentiment Analysis

For sentiment analysis of the tweet contents we employed “Syuzhet” package. It is an ” An R package for the extraction of sentiment and sentiment-based plot arcs from text.”

NOTE: I think we do a sentiment analysis here on all tweets? Why? This is random and unrelated to the research question as it stands. We can still keep this but then We should also do a sentiment analysis on tweets about Ukraine and compare how sentiment is different.

# Sentiment analysis for all tweets
# Germany 
Text_all_g <- Ukriane_tweets_g$text

Text_all_g <- gsub("#\\S*", "", Text_all_g)
Text_all_g <- gsub("https\\S*", "", Text_all_g) 
Text_all_g <- gsub("@\\S*", "", Text_all_g)
Text_all_g <- gsub("amp", "", Text_all_g) 
Text_all_g <- gsub("[\r\n]", "", Text_all_g)
Text_all_g <- gsub("[[:punct:]]", "", Text_all_g)
Text_all_g <- gsub("\\d", "", Text_all_g)
Text_all_g <- na.omit(toupper(str_squish(Text_all_g)))

ger_all = corpus(Text_all_g) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(ger_all, max_words=200)

words <- sort(colSums(ger_all), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 300, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Germany")

# Sentiment Analysis
tg <- iconv(Text_all_g)
s1 <- get_nrc_sentiment(tg, language = "german")
## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## Please use `spread()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
barplot(colSums(s1),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Germany - Sentiment Scores Tweets')

values_g <- get_sentiment(Text_all_g, method = "syuzhet", language = "german")
simple_plot(values_g)

#Austria
Text_all_a <- Ukriane_tweets_a$text
Text_all_a <- gsub("#\\S*", "", Text_all_a)
Text_all_a <- gsub("https\\S*", "", Text_all_a) 
Text_all_a <- gsub("@\\S*", "", Text_all_a)
Text_all_a <- gsub("amp", "", Text_all_a) 
Text_all_a <- gsub("[\r\n]", "", Text_all_a)
Text_all_a <- gsub("[[:punct:]]", "", Text_all_a)
Text_all_a <- gsub("\\d", "", Text_all_a)
Text_all_a <- na.omit(toupper(str_squish(Text_all_a)))


aus_all = corpus(Text_all_a) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(aus_all, max_words=200)

words <- sort(colSums(aus_all), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 200, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Austria")

# Sentiment Analysis
ta <- iconv(Text_all_a)
s2 <- get_nrc_sentiment(ta, language = "german")
barplot(colSums(s2),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Austria - Sentiment Scores Tweets')

values_a <- get_sentiment(Text_all_a, method = "syuzhet", language = "german")
simple_plot(values_a)

# Sentiment analysis for the tweets that include #Ukraine
# Germany 
#Create a vector containing only the text
Text_g <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,] #selecting the tweets that include #Ukraine
Text_g <- Text_g$text

# clean the text
Text_g <- gsub("#\\S*", "", Text_g)
Text_g <- gsub("https\\S*", "", Text_g) 
Text_g <- gsub("@\\S*", "", Text_g)
Text_g <- gsub("amp", "", Text_g) 
Text_g <- gsub("[\r\n]", "", Text_g)
Text_g <- gsub("[[:punct:]]", "", Text_g)
Text_g <- gsub("\\d", "", Text_g)
Text_g <- na.omit(toupper(str_squish(Text_g)))

ger = corpus(Text_g) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(ger, max_words=200)

words <- sort(colSums(ger), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 30, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Germany")

tg <- iconv(Text_g)
s1 <- get_nrc_sentiment(tg, language = "german")
barplot(colSums(s1),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Germany - Sentiment Scores Tweets')

values_g <- get_sentiment(Text_g, method = "syuzhet", language = "german")
simple_plot(values_g)

#Austria
Text_a <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]
Text_a <- Text_a$text

# clean the text
Text_a <- gsub("#\\S*", "", Text_a)
Text_a <- gsub("https\\S*", "", Text_a) 
Text_a <- gsub("@\\S*", "", Text_a)
Text_a <- gsub("amp", "", Text_a) 
Text_a <- gsub("[\r\n]", "", Text_a)
Text_a <- gsub("[[:punct:]]", "", Text_a)
Text_a <- gsub("\\d", "", Text_a)
Text_a <- na.omit(toupper(str_squish(Text_a)))

aus = corpus(Text_a) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(aus, max_words=200)

words <- sort(colSums(aus), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 10, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Austria")

ta <- iconv(Text_a)
s2 <- get_nrc_sentiment(ta, language = "german") # the number of postive and negative terms
barplot(colSums(s2),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Austria - Sentiment Scores Tweets')

values_a <- get_sentiment(Text_a, method = "syuzhet", language = "german")
simple_plot(values_a)

Step 8: Topic Detection

In order to apply a topic detection to tweet contents we utilized Latent Dirichlet Allocation (LDA). The algoritm produced the following topic categories and the keywords associated with them.

NOTE Can not we get rid of stop words here as well? there is so many adverbs and meaningless stuff in the result. It is hard to interpret.

# LDA for for the all tweets.
lda_all_g = ger_all %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_all_g, 10)
##       Topic 1         Topic 2        Topic 3    Topic 4             Topic 5    
##  [1,] "müssen"        "heute"        "krieg"    "presseinfo"        "bm"       
##  [2,] "deutschland"   "uhr"          "ukraine"  "mehr"              "sagt"     
##  [3,] "dafür"         "unserer"      "europa"   "neue"              "wäre"     
##  [4,] "wer"           "live"         "unsere"   "gibt"              "ende"     
##  [5,] "unsere"        "unsere"       "russland" "brauchen"          "heute"    
##  [6,] "gut"           "ab"           "putin"    "weniger"           "u"        
##  [7,] "mehr"          "freiheit"     "putins"   "ministerpräsident" "darüber"  
##  [8,] "verantwortung" "opfer"        "seite"    "statt"             "fragen"   
##  [9,] "erste"         "morgen"       "stehen"   "müssen"            "bundestag"
## [10,] "recht"         "gesellschaft" "lage"     "braucht"           "deutsche" 
##       Topic 6           Topic 7       Topic 8       Topic 9          
##  [1,] "interview"       "gute"        "beim"        "mehr"           
##  [2,] "sei"             "glückwunsch" "brauchen"    "euro"           
##  [3,] "endlich"         "herzlichen"  "bayern"      "brauchen"       
##  [4,] "bürger"          "danke"       "müssen"      "müssen"         
##  [5,] "müssen"          "lieber"      "setzen"      "mrd"            
##  [6,] "bundesregierung" "frankwalter" "heute"       "schnell"        
##  [7,] "heute"           "erfolg"      "energien"    "bürger"         
##  [8,] "geht"            "freue"       "deutschland" "bundesregierung"
##  [9,] "völlig"          "arbeit"      "cl"          "macht"          
## [10,] "schritt"         "dank"        "zukunft"     "milliarden"     
##       Topic 10      
##  [1,] "frauen"      
##  [2,] "mehr"        
##  [3,] "saarland"    
##  [4,] "minister"    
##  [5,] "kinder"      
##  [6,] "tage"        
##  [7,] "landtagswahl"
##  [8,] "heute"       
##  [9,] "themen"      
## [10,] "erfahren"
# Austria
lda_all_a = aus_all %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_all_a, 10)
##       Topic 1      Topic 2     Topic 3       Topic 4     Topic 5      
##  [1,] "krieg"      "russia"    "österreich"  "ukraine"   "österreich" 
##  [2,] "ukraine"    "ukraine"   "geht"        "russian"   "russische"  
##  [3,] "putins"     "russian"   "övp"         "ukrainian" "angriff"    
##  [4,] "heute"      "kyiv"      "wurde"       "city"      "russland"   
##  [5,] "kurz"       "people"    "wksta"       "putin"     "unsere"     
##  [6,] "europa"     "new"       "steht"       "breaking"  "ukraine"    
##  [7,] "müssen"     "now"       "nehammer"    "says"      "danke"      
##  [8,] "russischen" "us"        "regierung"   "said"      "putin"      
##  [9,] "wien"       "today"     "macht"       "minister"  "solidarität"
## [10,] "russland"   "ukrainian" "neutralität" "russias"   "sanktionen" 
##       Topic 6      Topic 7       Topic 8  Topic 9    Topic 10   
##  [1,] "mehr"       "impfpflicht" "mehr"   "gute"     "heute"    
##  [2,] "regierung"  "immer"       "immer"  "heute"    "geht"     
##  [3,] "seit"       "mehr"        "seit"   "viele"    "sobotka"  
##  [4,] "österreich" "regierung"   "wer"    "tag"      "gemeinsam"
##  [5,] "geht"       "österreich"  "schon"  "pandemie" "gast"     
##  [6,] "endlich"    "gibt"        "ja"     "bitte"    "los"      
##  [7,] "jahren"     "heute"       "övp"    "omikron"  "gibt"     
##  [8,] "europa"     "övp"         "geht"   "mehr"     "övp"      
##  [9,] "heute"      "schon"       "müssen" "wurde"    "zackzack" 
## [10,] "wenig"      "fpö"         "jahren" "schon"    "erfolg"
# LDA for for the tweets that include #Ukraine
# Germany
lda_g = ger %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_g, 10)
##       Topic 1       Topic 2      Topic 3         Topic 4        Topic 5        
##  [1,] "krieg"       "angriff"    "unseren"       "freiheit"     "unsere"       
##  [2,] "heute"       "presseinfo" "partnern"      "unsere"       "krieg"        
##  [3,] "ukraine"     "krieg"      "putins"        "demokratie"   "putins"       
##  [4,] "mehr"        "land"       "angriffskrieg" "deutschland"  "heute"        
##  [5,] "kiew"        "russland"   "gemeinsam"     "ukrainischen" "uhr"          
##  [6,] "unsere"      "putin"      "stehen"        "frieden"      "lage"         
##  [7,] "gast"        "treffen"    "seite"         "präsident"    "thema"        
##  [8,] "verurteilen" "russischen" "leid"          "helfen"       "gespräch"     
##  [9,] "schärfste"   "sofort"     "angriff"       "russische"    "danke"        
## [10,] "seit"        "seite"      "eu"            "seite"        "angriffskrieg"
##       Topic 6         Topic 7       Topic 8       Topic 9       Topic 10       
##  [1,] "ukraine"       "bayern"      "krieg"       "ukraine"     "hilfe"        
##  [2,] "fraktionschef" "folgen"      "uhr"         "krieg"       "deutschland"  
##  [3,] "eu"            "brauchen"    "zeichen"     "europa"      "mehr"         
##  [4,] "gemeinsam"     "krieg"       "heute"       "mehr"        "müssen"       
##  [5,] "sei"           "helfen"      "live"        "stehen"      "kannst"       
##  [6,] "unsere"        "bund"        "berlin"      "seite"       "heute"        
##  [7,] "heute"         "hilft"       "angriff"     "tag"         "deutschen"    
##  [8,] "waffen"        "müssen"      "russischen"  "solidarität" "unterstützung"
##  [9,] "klar"          "verteilung"  "deutschland" "russland"    "folgen"       
## [10,] "krieg"         "solidarität" "frieden"     "gilt"        "bayern"
# Austria
lda_a = aus %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_a, 10)
##       Topic 1      Topic 2        Topic 3      Topic 4       Topic 5       
##  [1,] "people"     "angriff"      "angriff"    "krieg"       "solidarität" 
##  [2,] "österreich" "krieg"        "ukraine"    "sicherheit"  "ukraine"     
##  [3,] "russian"    "unsere"       "now"        "starkes"     "schon"       
##  [4,] "united"     "bevölkerung"  "österreich" "frieden"     "medizinische"
##  [5,] "danke"      "heute"        "russlands"  "wien"        "seiten"      
##  [6,] "ukraine"    "österreichs"  "seit"       "heute"       "volle"       
##  [7,] "make"       "solidarität"  "russland"   "putins"      "toy"         
##  [8,] "peace"      "gilt"         "vergessen"  "zeichen"     "bridge"      
##  [9,] "must"       "ukrainischen" "steht"      "europas"     "people"      
## [10,] "oh"         "mitgefühl"    "vielen"     "österreichs" "österreich"  
##       Topic 6     Topic 7      Topic 8     Topic 9       Topic 10    
##  [1,] "ukraine"   "russischen" "krieg"     "heute"       "österreich"
##  [2,] "geht"      "krieg"      "müssen"    "putins"      "europa"    
##  [3,] "darum"     "wien"       "crowd"     "uhr"         "unsere"    
##  [4,] "hospital"  "helfen"     "unfassbar" "solidarität" "ukraine"   
##  [5,] "stop"      "seite"      "zeiten"    "heldenplatz" "russland"  
##  [6,] "ukrainian" "millionen"  "schauen"   "stop"        "evacuation"
##  [7,] "tag"       "angriff"    "us"        "unsere"      "angriff"   
##  [8,] "bereits"   "verloren"   "jahren"    "hilfe"       "hours"     
##  [9,] "haltung"   "russland"   "russland"  "putin"       "years"     
## [10,] "gedanken"  "gibt"       "russische" "wiener"      "härtesten"

By applying this unsupervised ML method we acquired certain divisions between tweets based on keywords. This

Results & Discussion