This analysis was carried out within the framework of UMEVT, Understanding and Mitigating Electoral Violence in Turkey. UMEVT is funded by the British Academy, Newton Advanced Fellowship Scheme
The construction of the Database of Incidents of Electoral Violence in Turkey (DIEV-T), was carried out by researchers based at Atilim University and King’s College London. The project leaders were Emre Toros (Atilim University) and Sarah Birch (King’s College London). All data within the UMEVT project will become publicly available after the completion of the project.
This is an early attempt to understand the text on electoral integrity and electoral violence news appeared on Turkish newspapers on a number of elections by a number of tools including the Latent Dirichlet Allocation (LDA) which is a generative statistical procedure that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
This piece is limited to presentaion of the analysis and does not include any comments or conclusions. For the technical details of the tools used please refer to “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.(https://www.tidytextmining.com/)
DIEV-T is designed to measure electoral violence in Turkey in parliamentary, local and presidential elections that have taken place since 1950. The DIEV-T includes content analysis of newspaper reports for each election, at least 15 days from before to 15 days after the each election day. All documented incidents of electoral violence during each one-month period are identified. The following aspects of each incident of election violence will be coded: time, place, type of incident, characteristics of the perpetrators(s) and victims(s) of violence, including number, partisan affiliation, state affiliation and gender.
The content analysis was carried out on a sample of national news media coverage for each election that took place. The database utilized newspapers because these media are consistently listed as the most important sources of information for Turkish citizens. The DIEV-T was extracted from digital copies of the newspapers and produced by the set coding procedure.
In choosing newspapers, we aim to include titles with broad coverage; we also aim for ideological diversity. Hence, newspapers were selected to provide a comprehensive picture of news coverage in Turkey.
The data used in this particular analysis collected from various newspapers during elections of 2009, 2011, 2014, 2015 and 2018 and when it is completed it will extend back to 1950. Although the data includes a number of other variables, the variable that will be utilized in this particular study will cover the headings and/or headlines which are on electoral integrity and violence news.
The following analysis is inspired by “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.(https://www.tidytextmining.com/). As an overarching theme the newspapers are grouped as pro-government, anti-government and middle-of-the-road newspapers.
DIEV.T <- read.csv("C:/Users/user/OneDrive/Makale/ElectoralIntegrity/DIEVT/DIEV-T.csv", sep=";")
data <- DIEV.T
# creating pro/anti gov newspaper variable
library(dplyr)
data <- data %>%
mutate(np_ide = case_when(
Newspaper == 'Akit' ~ "pro",
Newspaper == 'Haberturk' ~ "pro",
Newspaper == 'Sabah' ~ "pro",
Newspaper == 'Aydınlık' ~ "anti",
Newspaper == 'Cumhuriyet' ~ "anti",
Newspaper == 'Evrensel' ~ "anti",
Newspaper == 'Ortadogu' ~ "anti",
Newspaper == 'Radikal' ~ "anti",
Newspaper == 'Sözcü' ~ "anti",
TRUE ~ "middle"))
The following function is for converting the Turkish characters. On most of the occasions R experience encoding problems and the following code is highly recommended for solving problems as such
to.plain <- function(s) {
# 1 character substitutions
old1 <- "çğşıüöÇĞŞIÜÖ"
new1 <- "cgsiuocgsiuo"
s1 <- chartr(old1, new1, s)
# 2 character substitutions
old2 <- c("œ", "ß", "æ", "ø")
new2 <- c("oe", "ss", "ae", "oe")
s2 <- s1
for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
s2
}
# usage: data$variable <- to.plain(data$variable)
data$Headline <- to.plain(data$Headline)
To understand a term’s importance, term’s inverse document frequency (idf) can be calculated, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. (Silge & Robinson, 2018)
library(dplyr)
library(tidytext)
library(ggplot2)
#creating the tibble
headlines_tibble <- data %>%
select(Headline, np_ide)
as_tibble(headlines_tibble)
## # A tibble: 1,358 x 2
## Headline np_ide
## <chr> <chr>
## 1 MHP belediye imkanlarini kullaniyor anti
## 2 AKP zayifliyor anti
## 3 HDP secim aracina saldiri anti
## 4 Kaymakam mi yoksa AKP ilce baskani mi anti
## 5 Koyluye mektuplu degil askerli akrepli tehdit anti
## 6 Cami altina AKP burosu anti
## 7 Cami altina AKP burosu anti
## 8 Her fabrika bir secim burosu anti
## 9 Erdogan'in mitingine katilim ricasi anti
## 10 Davutoglu ozurlu diye konustu anti
## # ... with 1,348 more rows
headline_words<- headlines_tibble %>%
unnest_tokens(word, Headline) %>%
count(np_ide, word, sort = TRUE) %>%
ungroup()
total_words <- headline_words%>%
group_by(np_ide) %>%
summarize(total = sum(n))
headline_words<- left_join(headline_words, total_words)
headline_words
## # A tibble: 2,606 x 4
## np_ide word n total
## <chr> <chr> <int> <int>
## 1 anti secim 112 3236
## 2 anti oy 75 3236
## 3 anti saldiri 59 3236
## 4 anti yine 52 3236
## 5 anti akp 47 3236
## 6 anti her 42 3236
## 7 middle oy 40 1915
## 8 anti yerde 34 3236
## 9 middle ve 33 1915
## 10 middle secim 29 1915
## # ... with 2,596 more rows
headline_words <- headline_words %>%
bind_tf_idf(word, np_ide, n)
headline_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(np_ide) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = np_ide)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~np_ide, ncol = 2, scales = "free") +
coord_flip()
Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when the researcher is not sure what s/he is looking for.Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.(Silge & Robinson, 2018)
Extracting 2 topics from all newspapers.
# Vectorizing the headlines
v_headline <- data$Headline
# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)
#cleaning corpus
headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))
# creating the Document Term Matrix
library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)
library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))
headline_topics <- tidy(headline_lda, matrix = "beta")
library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
headline_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
The first topic can be labelled as “Violence” and second as “Integrity”.
#filtering the data for pro government news papers
data_pro <- data %>%
filter(np_ide == "pro")
# Vectorizing the headlines
v_headline <- data_pro$Headline
# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)
#cleaning corpus
headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))
# creating the Document Term Matrix
library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)
library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))
headline_topics <- tidy(headline_lda, matrix = "beta")
library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
headline_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
#filtering the data for anti government news papers
data_anti <- data %>%
filter(np_ide == "anti")
# Vectorizing the headlines
v_headline <- data_anti$Headline
# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)
#cleaning corpus
headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))
# creating the Document Term Matrix
library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)
library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))
headline_topics <- tidy(headline_lda, matrix = "beta")
library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
headline_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
The following analysis provides some correlation analysis grouped by newspaper ideologies
data_pro <- data %>%
filter(np_ide == "pro")
# Vectorizing the headlines
v_headline <- data_pro$Headline
# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)
#cleaning corpus
headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))
# creating the termdocument matrix
headline_tdm <- TermDocumentMatrix(headline_corpus)
library(qdap)
# hdp corelations
cor.hdp <- findAssocs(headline_tdm, "hdp", 0.2)
# converting corelations do dtabase
cor.hdp.df <- list_vect2df(cor.hdp)[, 2:3]
# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.hdp.df, aes(y = cor.hdp.df [, 1])) +
geom_point(aes(x = cor.hdp.df [, 2]),
data = cor.hdp.df , size = 3) +
theme_gdocs()
library(qdap)
# chp corelations
cor.chp <- findAssocs(headline_tdm, "chp", 0.2)
# converting corelations do dtabase
cor.chp.df <- list_vect2df(cor.chp)[, 2:3]
# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.chp.df, aes(y = cor.chp.df [, 1])) +
geom_point(aes(x = cor.chp.df [, 2]),
data = cor.chp.df , size = 3) +
theme_gdocs()
library(qdap)
data_pro <- data %>%
filter(np_ide == "anti")
# Vectorizing the headlines
v_headline <- data_anti$Headline
# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)
#cleaning corpus
headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))
# creating the termdocument matrix
headline_tdm <- TermDocumentMatrix(headline_corpus)
#corelations
cor.akp <- findAssocs(headline_tdm, "akp", 0.2)
# converting corelations do dtabase
cor.akp.df <- list_vect2df(cor.akp)[, 2:3]
# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.akp.df, aes(y = cor.akp.df[, 1])) +
geom_point(aes(x = cor.akp.df [, 2]),
data = cor.akp.df , size = 3) +
theme_gdocs()
# creating the tibble for analysis
ngram_hl <- data %>%
select(Headline, np_ide)
as_tibble(ngram_hl)
## # A tibble: 1,358 x 2
## Headline np_ide
## <chr> <chr>
## 1 MHP belediye imkanlarini kullaniyor anti
## 2 AKP zayifliyor anti
## 3 HDP secim aracina saldiri anti
## 4 Kaymakam mi yoksa AKP ilce baskani mi anti
## 5 Koyluye mektuplu degil askerli akrepli tehdit anti
## 6 Cami altina AKP burosu anti
## 7 Cami altina AKP burosu anti
## 8 Her fabrika bir secim burosu anti
## 9 Erdogan'in mitingine katilim ricasi anti
## 10 Davutoglu ozurlu diye konustu anti
## # ... with 1,348 more rows
# Creating the bigrams from headlines
library(dplyr)
library(tidytext)
headline_bigrams <- ngram_hl %>%
unnest_tokens(bigram, Headline, token = "ngrams", n = 2)
# Sorting the bigrams
headline_bigrams %>%
count(bigram, sort = T)
## # A tibble: 3,835 x 2
## bigram n
## <chr> <int>
## 1 her yerde 34
## 2 oy pusulasi 18
## 3 oy ve 18
## 4 secim sonuclarina 17
## 5 ve otesi 17
## 6 yerde yolsuzluk 17
## 7 14 ilde 16
## 8 ilde itiraz 16
## 9 itiraz var 16
## 10 onbin oyda 16
## # ... with 3,825 more rows
#tf_idf analysis
headline_tf_idf <- headline_bigrams %>%
count(np_ide, bigram) %>%
bind_tf_idf(bigram, np_ide, n) %>%
arrange(desc(tf_idf))
# Graph
library(ggplot2)
headline_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(np_ide) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = np_ide)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~np_ide, ncol = 2, scales = "free") +
coord_flip()
library(igraph)
# seperating the bi grams
library(tidyr)
bigrams_separated <- headline_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
count(word1, word2, sort = TRUE)
bigram_graph <- bigrams_separated %>%
filter(n > 5) %>%
graph_from_data_frame()
#graph
library(ggraph)
a <- grid::arrow(type = "closed", length = unit(.10, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()