2. Data & Research Design

In order to answer our research question, we will focus on data collected from a list of political parties and politicians from both countries. We acquired the data from iCandid and with the permission of Prof. Leen d’Haenens. The original dataset consisted of tweets and their metadata between 2015 and 2022 and is collected based on a list of Twitter handles of politicians and parties of 4 European countries (Germany, Austria, Italy and Hungary).

We chose to focus on Germany and Austria in our research, due to 3 reasons: 1) They both have important ties and connections to Russia, yet they are also both EU members, 2) their national language is German, making certain NLP applications and analysis easier and more consistent. 3) Our team’s language capabilities and familiarity with their national contexts.

Time period: The dataset covers 1st January 2015 to 3rd April 2022. We The occupation started on 24th February. However, we wanted to monitor the emergence of the crisis and included all the tweets from 2022.

Dataest Features: * Id: Unique tweet id of the tweet Type: iCandid generated item type. All data has the type “Message” Author: The original author of the tweet * Text: The text of the tweet * Sender: The sending account of the tweet. The values in this feature correspond to our list of accounts * datePublished: Date of the tweet * Url: Url address of the tweet * Keywords: Hashtags contained in the text * Mentions: Mentions contained in the text * Country: The country of the origin of the account We focused on sender, text, date and hashtags in order to apply our computational analyses.

This research project has the goal of comparing the two different countries with certain historical, social and linguistic ties and find similarities and differences in their online political discussions of the invasion of Ukraine. In order to do so, we will first clean the data. Then we will proceed with some descriptive statistics based on party and date based visualisations and tables. Then we will move on to the linguistic analyses starting with wordclouds. Then we will move on to more advanced NLP applications, namely sentiment analysis and topic detection of the tweets. In all these methods we will pay special attention to 1) any similar trends and patterns emerging between the two countries and 2) differences in linguistic expressions and social media grammars. Moreover, we will also try to detect prominent political actors in relation to Ukraine crisis in each country.

3. Analysis

Step 1: recall the needed pakages

library(knitr)
library(glue)
library(tidyverse)
library(readr)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(wordcloud2)
library(tm)
library(SnowballC)
library(RCurl)
library(XML)
library(tidytext)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(udpipe)
library(spacyr)
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(topicmodels)
library(cowplot)

Step 2: Import the data

The dataset consists of 2 .csv files, one for Germany and another for Austria.

opts_knit$set(progress=FALSE, verbose=FALSE)
twitter_Germany <- read_delim("twitterDuitsePoliticiAccount.csv", 
    delim = "\t", escape_double = FALSE, 
    trim_ws = TRUE)

## Rows: 284198 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (8): id, type, author, text, sender, url, keywords, mentions
## date (1): datePublished
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View(twitter_Germany)

opts_knit$set(progress=FALSE, verbose=FALSE)
twitter_Austria <- read_delim("twitterOostenrijksePoliticiAccount.csv", 
    delim = "\t", escape_double = FALSE, 
    trim_ws = TRUE)

## Rows: 188936 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (8): id, type, author, text, sender, url, keywords, mentions
## date (1): datePublished
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View(twitter_Austria)

A look into the dataframes: Germany

head(twitter_Germany)

Austria

head(twitter_Austria)

Step 3: Filtering the time period

Select the time period of the study and the required variables.

# Select tweets through 2022-01-01 till 2022-04-03
tweets_Germany <- filter(twitter_Germany, 
                  datePublished > "2022-01-01")

tweets_Austria <- filter(twitter_Austria, 
                  datePublished > "2022-01-01")

# select the targeted variables
tweets_Germany <- select(tweets_Germany, author, text, sender, datePublished, keywords, mentions)
tweets_Austria <- select(tweets_Austria, author, text, sender, datePublished, keywords, mentions)

Step 4: Twitter hashtag frequency

In this part, the frequency of Ukraine in both countries is calculated and visualized.

Germany

hashtags_Germany <- data.frame(str_split_fixed(tweets_Germany$keywords, ",", 10), datePublished = tweets_Germany$datePublished, sender = tweets_Germany$sender, mentions = tweets_Germany$mentions)
hashtags_Germany <- tibble(hashtags_Germany)

hashtags_Germany$X1 <- str_detect(toupper(str_squish(hashtags_Germany$X1)), "UKRAIN.*")
hashtags_Germany$X2 <- str_detect(toupper(str_squish(hashtags_Germany$X2)), "UKRAIN.*")
hashtags_Germany$X3 <- str_detect(toupper(str_squish(hashtags_Germany$X3)), "UKRAIN.*")
hashtags_Germany$X4 <- str_detect(toupper(str_squish(hashtags_Germany$X4)), "UKRAIN.*")
hashtags_Germany$X5 <- str_detect(toupper(str_squish(hashtags_Germany$X5)), "UKRAIN.*")
hashtags_Germany$X6 <- str_detect(toupper(str_squish(hashtags_Germany$X6)), "UKRAIN.*")
hashtags_Germany$X7 <- str_detect(toupper(str_squish(hashtags_Germany$X7)), "UKRAIN.*")
hashtags_Germany$X8 <- str_detect(toupper(str_squish(hashtags_Germany$X8)), "UKRAIN.*")
hashtags_Germany$X9 <- str_detect(toupper(str_squish(hashtags_Germany$X9)), "UKRAIN.*")
hashtags_Germany$X10 <- str_detect(toupper(str_squish(hashtags_Germany$X10)), "UKRAIN.*")

hashtags_Germany$Count_g <- hashtags_Germany$X1+hashtags_Germany$X2+hashtags_Germany$X3+hashtags_Germany$X4+hashtags_Germany$X5+hashtags_Germany$X6+hashtags_Germany$X7+hashtags_Germany$X8+hashtags_Germany$X9+hashtags_Germany$X10

# This all data created in new object to be used later to extract the tweets that are related to Ukriane.
Ukriane_tweets_g <- cbind(hashtags_Germany, text = tweets_Germany$text)

Austria

hashtags_Austria <- data.frame(str_split_fixed(tweets_Austria$keywords, ",", 10), datePublished = tweets_Austria$datePublished, sender = tweets_Austria$sender, mentions = tweets_Austria$mentions)
hashtags_Austria <- tibble(hashtags_Austria)

hashtags_Austria$X1 <- str_detect(toupper(str_squish(hashtags_Austria$X1)), "UKRAIN.*")
hashtags_Austria$X2 <- str_detect(toupper(str_squish(hashtags_Austria$X2)), "UKRAIN.*")
hashtags_Austria$X3 <- str_detect(toupper(str_squish(hashtags_Austria$X3)), "UKRAIN.*")
hashtags_Austria$X4 <- str_detect(toupper(str_squish(hashtags_Austria$X4)), "UKRAIN.*")
hashtags_Austria$X5 <- str_detect(toupper(str_squish(hashtags_Austria$X5)), "UKRAIN.*")
hashtags_Austria$X6 <- str_detect(toupper(str_squish(hashtags_Austria$X6)), "UKRAIN.*")
hashtags_Austria$X7 <- str_detect(toupper(str_squish(hashtags_Austria$X7)), "UKRAIN.*")
hashtags_Austria$X8 <- str_detect(toupper(str_squish(hashtags_Austria$X8)), "UKRAIN.*")
hashtags_Austria$X9 <- str_detect(toupper(str_squish(hashtags_Austria$X9)), "UKRAIN.*")
hashtags_Austria$X10 <- str_detect(toupper(str_squish(hashtags_Austria$X10)), "UKRAIN.*")

hashtags_Austria$Count_a <- hashtags_Austria$X1+hashtags_Austria$X2+hashtags_Austria$X3+hashtags_Austria$X4+hashtags_Austria$X5+hashtags_Austria$X6+hashtags_Austria$X7+hashtags_Austria$X8+hashtags_Austria$X9+hashtags_Austria$X10

# This all data created in new object to be used later to extract the tweets that are related to Ukriane.
Ukriane_tweets_a <- cbind(hashtags_Austria, text = tweets_Austria$text)

These two tables above show that parties in Germany tweet more compared to individual politicians, while politicians from Austria tweet more than the parties.

NOTE would not it be better 1) we limit these tables to top 10 (they are too long) 2) have them as a bar charts

NOTE 2 can not we also get the same bar charts/tables with Ukraine hashtags/or contains ukraine. Then we can normalize them. Then we would see (for example in percentages) which party or political leader devoted more tweets to Ukraine. That would be a great discussion

Step 4: Tweet count by account

It is informative to see who tweeted the most during the relevant period. In order to see that we will construct simple a frequency list of tweets by the account who sent the tweets.

Germany

# All tweets
## Germany
Sender_freq_g <- str_squish(unlist(na.omit(toupper(str_squish(tweets_Germany$sender)))))
Senders <- data.frame(sort(table(Sender_freq_g), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_g, Freq=Senders$Freq)
f1 <- df[1:10,]

## Austria
Sender_freq_a <- str_squish(unlist(na.omit(toupper(str_squish(tweets_Austria$sender)))))
Senders <- data.frame(sort(table(Sender_freq_a), decreasing=TRUE))

df <- data.frame(Sender = Senders$Sender_freq_a, Freq=Senders$Freq)
f2 <- df[1:10,]


# Ukraine tweets
## Germany
Sender_freq_g <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,]
Sender_freq_g <- str_squish(unlist(na.omit(toupper(str_squish(Sender_freq_g$sender)))))
Senders <- data.frame(sort(table(Sender_freq_g), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_g, Freq=Senders$Freq)
f3 <- df[1:10,]

## Austria
Sender_freq_a <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]
Sender_freq_a <- str_squish(unlist(na.omit(toupper(str_squish(Sender_freq_a$sender)))))
Senders <- data.frame(sort(table(Sender_freq_a), decreasing=TRUE))
df <- data.frame(Sender = Senders$Sender_freq_a, Freq=Senders$Freq)
f4 <- df[1:10, ]

f_all <- data.frame(f1,f2,f3,f4)
names(f_all) <- c("Senders_1", "Freq_1", "Sender_2", "Freq_2", "Senders_3", "Freq_3", "Senders_4", "Freq_4")
kable(f_all, col.names = c("All tweets senders (Germany)","Number of tweets","All tweets senders (Austria)","Number of tweets","Ukraine tweets senders (Germany)","Number of tweets","Ukraine tweets senders (Austria)","Number of tweets"))

All tweets senders (Germany)	Number of tweets	All tweets senders (Austria)	Number of tweets	Ukraine tweets senders (Germany)	Number of tweets	Ukraine tweets senders (Austria)	Number of tweets
SPD PARTEIVORSTAND 🇪🇺	1697	RUDI ANSCHOBER	2277	SPD PARTEIVORSTAND 🇪🇺	129	RUDI ANSCHOBER	98
CDU/CSU	1156	PETER PILZ	1353	CDU/CSU	105	PETER PILZ	17
FDP	1008	DAS NEUE ÖSTERREICH	560	CEM ÖZDEMIR	64	DAS NEUE ÖSTERREICH	12
JOANA COTAR	665	BEATE MEINL-REISINGER	522	FDP	59	DIE GRÜNEN	12
CSU	623	SPÖ	496	CDU DEUTSCHLANDS	48	HAGEN REINHOLD, MDB	11
CEM ÖZDEMIR	587	FPÖ	282	CSU	42	WERNER KOGLER	10
CDU DEUTSCHLANDS	540	MATTHIAS STROLZ	276	MARKUS SÖDER	34	SPÖ	9
DIE LINKE	520	DIE GRÜNEN	236	DIE LINKE	32	MATTHIAS STROLZ	7
ALTERNATIVE FÜR 🇩🇪 DEUTSCHLAND	449	WERNER KOGLER	199	CHRISTIAN LINDNER	29	BEATE MEINL-REISINGER	6
MARKUS SÖDER	407	HAGEN REINHOLD, MDB	143	SAHRA WAGENKNECHT	26	PAMELA RENDI-WAGNER	5

Step 6: Ukraine-hashtag Timeline

In this part, we will plot the sum of Ukraine related hastags in the given period. NOTE we should do this at least on the basis of weeks if not days. Then we can see better and speak about the plot more. Monthly division is too broad.

Agg_hashtags_g <- aggregate(Count_g ~ datePublished, data = hashtags_Germany, sum)
Agg_hashtags_a <- aggregate(Count_a ~ datePublished, data = hashtags_Austria, sum)
plot(Agg_hashtags_g$datePublished, Agg_hashtags_g$Count, type = "l", xlab = "Date", ylab = "Number of hashtaging UKRAINE")
lines(Agg_hashtags_a$datePublished, Agg_hashtags_a$Count, col = "red", type = "l")
legend("topleft", legend=c("Germany", "Austria"),
       col=c("Black", "Red"), lty=1, cex=0.8)

There were some initial political reactions from Germany before the invasion, while Austrian actors remained mostly silent on the issue. Unsurprisingly the tweets dramatically escalate as the invasion began in both countries. There are more Ukraine-hashtagged tweets from German actors, however we follow more users from there and we have more data. Thus this appereance can be related to the simple issue of quantity of data. Finally, German actors continue to tweet more about Ukraine, while the debate in Austria goes down.

Step 7: Wordclouds

There are two relevant ways of generating wordclouds in our data: 1) Based on the hashtags and 2) based on the tweet texts. We will conduct both starting with the hashtags.

Hashtags Wordclouds

In this part, we will generate wordclouds to see the main hashtags in our target time period.

# Germany
hashtags_freq_g <- str_squish(unlist(str_split(na.omit(toupper(str_squish(tweets_Germany$keywords))), ",")))
docs <- Corpus(VectorSource(hashtags_freq_g))
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,  max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

kable(df[1:30, ], caption = "Germany")

Germany
	word	freq
ukraine	ukraine	624
afd	afd	303
bundestag	bundestag	205
impfpflicht	impfpflicht	194
corona	corona	188
ampel	ampel	166
russland	russland	134
putin	putin	126
bundesversammlung	bundesversammlung	118
bundesregierung	bundesregierung	107
teamcdu	teamcdu	104
3k22	3k22	92
cdupt22	cdupt22	91
steinmeier	steinmeier	85
aufinsneue	aufinsneue	82
cdu	cdu	81
deutschland	deutschland	80
bundespräsident	bundespräsident	78
saarland	saarland	78
europa	europa	65
dbdk22	dbdk22	64
freiheit	freiheit	63
bundeswehr	bundeswehr	60
omikron	omikron	59
spd	spd	57
habeck	habeck	55
inflation	inflation	54
bayern	bayern	51
scholz	scholz	51
standwithukraine	standwithukraine	48

df1 <- df[1:15,]

The wordcloud of hashtags from Germany and its table representation shows AFD (Alternative für Deutschland) is prominently hastagged.There are also tweets related to national celebrations. Ukraine is the most mentioned hashtag, while Russia and Putin is also addressed.

# Austria
hashtags_freq_a <- str_squish(unlist(str_split(na.omit(toupper(str_squish(tweets_Austria$keywords))), ",")))
docs <- Corpus(VectorSource(hashtags_freq_a))
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,  max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

kable(df[1:30, ], caption = "Austria")

Austria
	word	freq
ukraine	ukraine	147
bmichats	bmichats	114
övp	övp	91
sobotka	sobotka	79
oevpua	oevpua	58
zib2	zib2	44
oenr	oenr	44
rotesfoyer	rotesfoyer	40
oevpkorruptionsua	oevpkorruptionsua	37
putin	putin	36
longcovid	longcovid	31
oevp	oevp	28
russland	russland	27
breaking	breaking	23
covid19	covid19	22
omikron	omikron	22
standwithukraine	standwithukraine	20
kloibmüller	kloibmüller	20
imzentrum	imzentrum	19
wksta	wksta	19
einland	einland	17
hessenthaler	hessenthaler	17
yeswecare	yeswecare	16
covid	covid	16
covid19at	covid19at	15
neutralität	neutralität	15
sideletter	sideletter	15
nehammer	nehammer	13
weremember	weremember	13
wolf	wolf	12

df2 <- df[1:15,]
par(mfrow = c(1,2))
barplot(df1$freq, names.arg = df1$word, las=2, col = 2, ylab = "Word frequency", xlab = "Hashtags", main = "Germany")

barplot(df2$freq, names.arg = df2$word, las=2, col = 3, ylab = "Word frequency", xlab = "Hashtags", main = "Austria")

NOTE the label of X axis collides with the long hastags on my screen can we disable the label on the x axis pls? Also there are more bars then there are hashtags in the X-axis. so it is not really helpful right now

Ukraine is also the leading hashtag in Austria for the given time period. However, unlike Germany many other more national debates follow it in hastags. Putin is just the 10th most occurring hashtags.

The code below generates 4 graphs in one frame. In order to show the relative frequency of all tweets versus Ukraine-related ones.

# # All Tweets
#
# # Germany
tweets_per_week = tweets_Germany %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g1 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="yellow2") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("All Tweets - Germany")
# # Austria
tweets_per_week = tweets_Austria %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g2 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="red3") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("All Tweets - Austria")
#
# # Ukraine Tweets
dfdu <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,]
dfau <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]

# # Germany
tweets_per_week = dfdu %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g3 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="khaki") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("Ukraine Tweets - Germany")
# # Austria
tweets_per_week = dfau %>%
  mutate(week=round_date(datePublished, "week")) %>%
  group_by(week) %>% summarize(n=n())
g4 <- ggplot(tweets_per_week, aes(x=week, y=n)) +
  geom_col(fill="indianred1") + theme_classic() +
  xlab("week") + ylab("# of tweets") +
  ggtitle("Ukraine Tweets - Austria")

plot_grid(g1, g2, g3, g4, labels ="AUTO")

# # Tweets over time, all and Ukraine related ones plot

This 4-bar chart demonstrates that political communication in Austria intensified with the invasion. More tweets were sent by parties and politicians compared to the relative low numbers in earlier 2022. Apart from this, the sharp rise with the invasion in number of tweets and their decline follow a similar pattern.

Step 7: Sentiment Analysis

For sentiment analysis of the tweet contents we employed “Syuzhet” package. It is an ” An R package for the extraction of sentiment and sentiment-based plot arcs from text.”

NOTE: I think we do a sentiment analysis here on all tweets? Why? This is random and unrelated to the research question as it stands. We can still keep this but then We should also do a sentiment analysis on tweets about Ukraine and compare how sentiment is different.

# Sentiment analysis for all tweets
# Germany 
Text_all_g <- Ukriane_tweets_g$text

Text_all_g <- gsub("#\\S*", "", Text_all_g)
Text_all_g <- gsub("https\\S*", "", Text_all_g) 
Text_all_g <- gsub("@\\S*", "", Text_all_g)
Text_all_g <- gsub("amp", "", Text_all_g) 
Text_all_g <- gsub("[\r\n]", "", Text_all_g)
Text_all_g <- gsub("[[:punct:]]", "", Text_all_g)
Text_all_g <- gsub("\\d", "", Text_all_g)
Text_all_g <- na.omit(toupper(str_squish(Text_all_g)))

ger_all = corpus(Text_all_g) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(ger_all, max_words=200)

words <- sort(colSums(ger_all), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 300, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Germany")

# Sentiment Analysis
tg <- iconv(Text_all_g)
s1 <- get_nrc_sentiment(tg, language = "german")

## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## Please use `spread()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

barplot(colSums(s1),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Germany - Sentiment Scores Tweets')

values_g <- get_sentiment(Text_all_g, method = "syuzhet", language = "german")
simple_plot(values_g)

#Austria
Text_all_a <- Ukriane_tweets_a$text
Text_all_a <- gsub("#\\S*", "", Text_all_a)
Text_all_a <- gsub("https\\S*", "", Text_all_a) 
Text_all_a <- gsub("@\\S*", "", Text_all_a)
Text_all_a <- gsub("amp", "", Text_all_a) 
Text_all_a <- gsub("[\r\n]", "", Text_all_a)
Text_all_a <- gsub("[[:punct:]]", "", Text_all_a)
Text_all_a <- gsub("\\d", "", Text_all_a)
Text_all_a <- na.omit(toupper(str_squish(Text_all_a)))


aus_all = corpus(Text_all_a) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(aus_all, max_words=200)

words <- sort(colSums(aus_all), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 200, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Austria")

# Sentiment Analysis
ta <- iconv(Text_all_a)
s2 <- get_nrc_sentiment(ta, language = "german")
barplot(colSums(s2),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Austria - Sentiment Scores Tweets')

values_a <- get_sentiment(Text_all_a, method = "syuzhet", language = "german")
simple_plot(values_a)

# Sentiment analysis for the tweets that include #Ukraine
# Germany 
#Create a vector containing only the text
Text_g <- Ukriane_tweets_g[Ukriane_tweets_g$Count_g > 0,] #selecting the tweets that include #Ukraine
Text_g <- Text_g$text

# clean the text
Text_g <- gsub("#\\S*", "", Text_g)
Text_g <- gsub("https\\S*", "", Text_g) 
Text_g <- gsub("@\\S*", "", Text_g)
Text_g <- gsub("amp", "", Text_g) 
Text_g <- gsub("[\r\n]", "", Text_g)
Text_g <- gsub("[[:punct:]]", "", Text_g)
Text_g <- gsub("\\d", "", Text_g)
Text_g <- na.omit(toupper(str_squish(Text_g)))

ger = corpus(Text_g) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(ger, max_words=200)

words <- sort(colSums(ger), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 30, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Germany")

tg <- iconv(Text_g)
s1 <- get_nrc_sentiment(tg, language = "german")
barplot(colSums(s1),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Germany - Sentiment Scores Tweets')

values_g <- get_sentiment(Text_g, method = "syuzhet", language = "german")
simple_plot(values_g)

#Austria
Text_a <- Ukriane_tweets_a[Ukriane_tweets_a$Count_a > 0,]
Text_a <- Text_a$text

# clean the text
Text_a <- gsub("#\\S*", "", Text_a)
Text_a <- gsub("https\\S*", "", Text_a) 
Text_a <- gsub("@\\S*", "", Text_a)
Text_a <- gsub("amp", "", Text_a) 
Text_a <- gsub("[\r\n]", "", Text_a)
Text_a <- gsub("[[:punct:]]", "", Text_a)
Text_a <- gsub("\\d", "", Text_a)
Text_a <- na.omit(toupper(str_squish(Text_a)))

aus = corpus(Text_a) %>% 
  tokens(remove_punct=T) %>% 
  dfm() %>%
  dfm_remove(stopwords("german")) %>%
  dfm_remove(stopwords("english")) %>%
  dfm_remove(c("dass", "menschen"))
textplot_wordcloud(aus, max_words=200)

words <- sort(colSums(aus), decreasing = T)
df <- data.frame(word = names(words), freq=words)
df <- df[df$freq > 10, ]
barplot(df$freq, names.arg = df$word, las=2, col = 2, main = "Austria")

ta <- iconv(Text_a)
s2 <- get_nrc_sentiment(ta, language = "german") # the number of postive and negative terms
barplot(colSums(s2),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Austria - Sentiment Scores Tweets')

values_a <- get_sentiment(Text_a, method = "syuzhet", language = "german")
simple_plot(values_a)

Step 8: Topic Detection

In order to apply a topic detection to tweet contents we utilized Latent Dirichlet Allocation (LDA). The algoritm produced the following topic categories and the keywords associated with them.

NOTE Can not we get rid of stop words here as well? there is so many adverbs and meaningless stuff in the result. It is hard to interpret.

# LDA for for the all tweets.
lda_all_g = ger_all %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_all_g, 10)

##       Topic 1         Topic 2        Topic 3    Topic 4             Topic 5    
##  [1,] "müssen"        "heute"        "krieg"    "presseinfo"        "bm"       
##  [2,] "deutschland"   "uhr"          "ukraine"  "mehr"              "sagt"     
##  [3,] "dafür"         "unserer"      "europa"   "neue"              "wäre"     
##  [4,] "wer"           "live"         "unsere"   "gibt"              "ende"     
##  [5,] "unsere"        "unsere"       "russland" "brauchen"          "heute"    
##  [6,] "gut"           "ab"           "putin"    "weniger"           "u"        
##  [7,] "mehr"          "freiheit"     "putins"   "ministerpräsident" "darüber"  
##  [8,] "verantwortung" "opfer"        "seite"    "statt"             "fragen"   
##  [9,] "erste"         "morgen"       "stehen"   "müssen"            "bundestag"
## [10,] "recht"         "gesellschaft" "lage"     "braucht"           "deutsche" 
##       Topic 6           Topic 7       Topic 8       Topic 9          
##  [1,] "interview"       "gute"        "beim"        "mehr"           
##  [2,] "sei"             "glückwunsch" "brauchen"    "euro"           
##  [3,] "endlich"         "herzlichen"  "bayern"      "brauchen"       
##  [4,] "bürger"          "danke"       "müssen"      "müssen"         
##  [5,] "müssen"          "lieber"      "setzen"      "mrd"            
##  [6,] "bundesregierung" "frankwalter" "heute"       "schnell"        
##  [7,] "heute"           "erfolg"      "energien"    "bürger"         
##  [8,] "geht"            "freue"       "deutschland" "bundesregierung"
##  [9,] "völlig"          "arbeit"      "cl"          "macht"          
## [10,] "schritt"         "dank"        "zukunft"     "milliarden"     
##       Topic 10      
##  [1,] "frauen"      
##  [2,] "mehr"        
##  [3,] "saarland"    
##  [4,] "minister"    
##  [5,] "kinder"      
##  [6,] "tage"        
##  [7,] "landtagswahl"
##  [8,] "heute"       
##  [9,] "themen"      
## [10,] "erfahren"

# Austria
lda_all_a = aus_all %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_all_a, 10)

##       Topic 1      Topic 2     Topic 3       Topic 4     Topic 5      
##  [1,] "krieg"      "russia"    "österreich"  "ukraine"   "österreich" 
##  [2,] "ukraine"    "ukraine"   "geht"        "russian"   "russische"  
##  [3,] "putins"     "russian"   "övp"         "ukrainian" "angriff"    
##  [4,] "heute"      "kyiv"      "wurde"       "city"      "russland"   
##  [5,] "kurz"       "people"    "wksta"       "putin"     "unsere"     
##  [6,] "europa"     "new"       "steht"       "breaking"  "ukraine"    
##  [7,] "müssen"     "now"       "nehammer"    "says"      "danke"      
##  [8,] "russischen" "us"        "regierung"   "said"      "putin"      
##  [9,] "wien"       "today"     "macht"       "minister"  "solidarität"
## [10,] "russland"   "ukrainian" "neutralität" "russias"   "sanktionen" 
##       Topic 6      Topic 7       Topic 8  Topic 9    Topic 10   
##  [1,] "mehr"       "impfpflicht" "mehr"   "gute"     "heute"    
##  [2,] "regierung"  "immer"       "immer"  "heute"    "geht"     
##  [3,] "seit"       "mehr"        "seit"   "viele"    "sobotka"  
##  [4,] "österreich" "regierung"   "wer"    "tag"      "gemeinsam"
##  [5,] "geht"       "österreich"  "schon"  "pandemie" "gast"     
##  [6,] "endlich"    "gibt"        "ja"     "bitte"    "los"      
##  [7,] "jahren"     "heute"       "övp"    "omikron"  "gibt"     
##  [8,] "europa"     "övp"         "geht"   "mehr"     "övp"      
##  [9,] "heute"      "schon"       "müssen" "wurde"    "zackzack" 
## [10,] "wenig"      "fpö"         "jahren" "schon"    "erfolg"

# LDA for for the tweets that include #Ukraine
# Germany
lda_g = ger %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_g, 10)

##       Topic 1       Topic 2      Topic 3         Topic 4        Topic 5        
##  [1,] "krieg"       "angriff"    "unseren"       "freiheit"     "unsere"       
##  [2,] "heute"       "presseinfo" "partnern"      "unsere"       "krieg"        
##  [3,] "ukraine"     "krieg"      "putins"        "demokratie"   "putins"       
##  [4,] "mehr"        "land"       "angriffskrieg" "deutschland"  "heute"        
##  [5,] "kiew"        "russland"   "gemeinsam"     "ukrainischen" "uhr"          
##  [6,] "unsere"      "putin"      "stehen"        "frieden"      "lage"         
##  [7,] "gast"        "treffen"    "seite"         "präsident"    "thema"        
##  [8,] "verurteilen" "russischen" "leid"          "helfen"       "gespräch"     
##  [9,] "schärfste"   "sofort"     "angriff"       "russische"    "danke"        
## [10,] "seit"        "seite"      "eu"            "seite"        "angriffskrieg"
##       Topic 6         Topic 7       Topic 8       Topic 9       Topic 10       
##  [1,] "ukraine"       "bayern"      "krieg"       "ukraine"     "hilfe"        
##  [2,] "fraktionschef" "folgen"      "uhr"         "krieg"       "deutschland"  
##  [3,] "eu"            "brauchen"    "zeichen"     "europa"      "mehr"         
##  [4,] "gemeinsam"     "krieg"       "heute"       "mehr"        "müssen"       
##  [5,] "sei"           "helfen"      "live"        "stehen"      "kannst"       
##  [6,] "unsere"        "bund"        "berlin"      "seite"       "heute"        
##  [7,] "heute"         "hilft"       "angriff"     "tag"         "deutschen"    
##  [8,] "waffen"        "müssen"      "russischen"  "solidarität" "unterstützung"
##  [9,] "klar"          "verteilung"  "deutschland" "russland"    "folgen"       
## [10,] "krieg"         "solidarität" "frieden"     "gilt"        "bayern"

# Austria
lda_a = aus %>% 
  convert(to = "topicmodels") %>%
  LDA(k=10,control=list(seed=123, alpha = 1/1:10))
terms(lda_a, 10)

##       Topic 1      Topic 2        Topic 3      Topic 4       Topic 5       
##  [1,] "people"     "angriff"      "angriff"    "krieg"       "solidarität" 
##  [2,] "österreich" "krieg"        "ukraine"    "sicherheit"  "ukraine"     
##  [3,] "russian"    "unsere"       "now"        "starkes"     "schon"       
##  [4,] "united"     "bevölkerung"  "österreich" "frieden"     "medizinische"
##  [5,] "danke"      "heute"        "russlands"  "wien"        "seiten"      
##  [6,] "ukraine"    "österreichs"  "seit"       "heute"       "volle"       
##  [7,] "make"       "solidarität"  "russland"   "putins"      "toy"         
##  [8,] "peace"      "gilt"         "vergessen"  "zeichen"     "bridge"      
##  [9,] "must"       "ukrainischen" "steht"      "europas"     "people"      
## [10,] "oh"         "mitgefühl"    "vielen"     "österreichs" "österreich"  
##       Topic 6     Topic 7      Topic 8     Topic 9       Topic 10    
##  [1,] "ukraine"   "russischen" "krieg"     "heute"       "österreich"
##  [2,] "geht"      "krieg"      "müssen"    "putins"      "europa"    
##  [3,] "darum"     "wien"       "crowd"     "uhr"         "unsere"    
##  [4,] "hospital"  "helfen"     "unfassbar" "solidarität" "ukraine"   
##  [5,] "stop"      "seite"      "zeiten"    "heldenplatz" "russland"  
##  [6,] "ukrainian" "millionen"  "schauen"   "stop"        "evacuation"
##  [7,] "tag"       "angriff"    "us"        "unsere"      "angriff"   
##  [8,] "bereits"   "verloren"   "jahren"    "hilfe"       "hours"     
##  [9,] "haltung"   "russland"   "russland"  "putin"       "years"     
## [10,] "gedanken"  "gibt"       "russische" "wiener"      "härtesten"

By applying this unsupervised ML method we acquired certain divisions between tweets based on keywords. This

Collecting & analyzing Big Data for Social Sciences

2. Data & Research Design

3. Analysis

Step 1: recall the needed pakages

Step 2: Import the data

Step 3: Filtering the time period

Step 4: Twitter hashtag frequency

Step 4: Tweet count by account

Step 6: Ukraine-hashtag Timeline

Step 7: Wordclouds

Hashtags Wordclouds

Step 7: Sentiment Analysis

Step 8: Topic Detection

Results & Discussion

Collecting & analyzing Big Data for Social Sciences

Collecting & Analyzing Big Data for Social Sciences

Assignment 2: Research Notebook

Group Members: Sercan Kıyak (0650472), Hatem Alharazin (0702702), Mathias Vasilev (0884889)

Overview

1. Research Question

2. Data & Research Design

3. Analysis

Step 1: recall the needed pakages

Step 2: Import the data

Step 3: Filtering the time period

Step 4: Twitter hashtag frequency

Step 4: Tweet count by account

Step 6: Ukraine-hashtag Timeline

Step 7: Wordclouds

Hashtags Wordclouds

Step 7: Sentiment Analysis

Step 8: Topic Detection

Results & Discussion