1 Introduction

This analysis was carried out within the framework of UMEVT, Understanding and Mitigating Electoral Violence in Turkey. UMEVT is funded by the British Academy, Newton Advanced Fellowship Scheme

The construction of the Database of Incidents of Electoral Violence in Turkey (DIEV-T), was carried out by researchers based at Atilim University and King’s College London. The project leaders were Emre Toros (Atilim University) and Sarah Birch (King’s College London). All data within the UMEVT project will become publicly available after the completion of the project.

This is an early attempt to understand the text on electoral integrity and electoral violence news appeared on Turkish newspapers on a number of elections by a number of tools including the Latent Dirichlet Allocation (LDA) which is a generative statistical procedure that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

This piece is limited to presentaion of the analysis and does not include any comments or conclusions. For the technical details of the tools used please refer to “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.(https://www.tidytextmining.com/)

2 Data

DIEV-T is designed to measure electoral violence in Turkey in parliamentary, local and presidential elections that have taken place since 1950. The DIEV-T includes content analysis of newspaper reports for each election, at least 15 days from before to 15 days after the each election day. All documented incidents of electoral violence during each one-month period are identified. The following aspects of each incident of election violence will be coded: time, place, type of incident, characteristics of the perpetrators(s) and victims(s) of violence, including number, partisan affiliation, state affiliation and gender.

The content analysis was carried out on a sample of national news media coverage for each election that took place. The database utilized newspapers because these media are consistently listed as the most important sources of information for Turkish citizens. The DIEV-T was extracted from digital copies of the newspapers and produced by the set coding procedure.

In choosing newspapers, we aim to include titles with broad coverage; we also aim for ideological diversity. Hence, newspapers were selected to provide a comprehensive picture of news coverage in Turkey.

The data used in this particular analysis collected from various newspapers during elections of 2009, 2011, 2014, 2015 and 2018 and when it is completed it will extend back to 1950. Although the data includes a number of other variables, the variable that will be utilized in this particular study will cover the headings and/or headlines which are on electoral integrity and violence news.

3 Analysis

The following analysis is inspired by “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.(https://www.tidytextmining.com/). As an overarching theme the newspapers are grouped as pro-government, anti-government and middle-of-the-road newspapers.

DIEV.T <- read.csv("C:/Users/user/OneDrive/Makale/ElectoralIntegrity/DIEVT/DIEV-T.csv", sep=";")
data <- DIEV.T


# creating pro/anti gov newspaper variable
library(dplyr)
data <- data %>% 
  mutate(np_ide = case_when( 
    Newspaper == 'Akit' ~ "pro",
    Newspaper == 'Haberturk' ~ "pro",
    Newspaper == 'Sabah' ~ "pro",
    Newspaper == 'Aydınlık' ~ "anti",
    Newspaper == 'Cumhuriyet' ~ "anti",
    Newspaper == 'Evrensel' ~ "anti",
    Newspaper == 'Ortadogu' ~ "anti",
    Newspaper == 'Radikal' ~ "anti",
    Newspaper == 'Sözcü' ~ "anti",
    TRUE ~ "middle"))

The following function is for converting the Turkish characters. On most of the occasions R experience encoding problems and the following code is highly recommended for solving problems as such

to.plain <- function(s) {
  
  # 1 character substitutions
  old1 <- "çğşıüöÇĞŞIÜÖ"
  new1 <- "cgsiuocgsiuo"
  s1 <- chartr(old1, new1, s)
  
  # 2 character substitutions
  old2 <- c("œ", "ß", "æ", "ø")
  new2 <- c("oe", "ss", "ae", "oe")
  s2 <- s1
  for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
  
  s2
}

# usage: data$variable <- to.plain(data$variable)

data$Headline <- to.plain(data$Headline) 

4 Term Frequency Analysis

To understand a term’s importance, term’s inverse document frequency (idf) can be calculated, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. (Silge & Robinson, 2018)

library(dplyr)
library(tidytext)
library(ggplot2)

#creating the tibble 
headlines_tibble <- data %>% 
  select(Headline, np_ide)

as_tibble(headlines_tibble)
## # A tibble: 1,358 x 2
##    Headline                                      np_ide
##    <chr>                                         <chr> 
##  1 MHP belediye imkanlarini kullaniyor           anti  
##  2 AKP zayifliyor                                anti  
##  3 HDP secim aracina saldiri                     anti  
##  4 Kaymakam mi yoksa AKP ilce baskani mi         anti  
##  5 Koyluye mektuplu degil askerli akrepli tehdit anti  
##  6 Cami altina AKP burosu                        anti  
##  7 Cami altina AKP burosu                        anti  
##  8 Her fabrika bir secim burosu                  anti  
##  9 Erdogan'in mitingine katilim ricasi           anti  
## 10 Davutoglu ozurlu diye konustu                 anti  
## # ... with 1,348 more rows
headline_words<- headlines_tibble %>%
unnest_tokens(word, Headline) %>%
count(np_ide, word, sort = TRUE) %>%
ungroup()

total_words <- headline_words%>%
group_by(np_ide) %>%
summarize(total = sum(n))

headline_words<- left_join(headline_words, total_words)

headline_words
## # A tibble: 2,606 x 4
##    np_ide word        n total
##    <chr>  <chr>   <int> <int>
##  1 anti   secim     112  3236
##  2 anti   oy         75  3236
##  3 anti   saldiri    59  3236
##  4 anti   yine       52  3236
##  5 anti   akp        47  3236
##  6 anti   her        42  3236
##  7 middle oy         40  1915
##  8 anti   yerde      34  3236
##  9 middle ve         33  1915
## 10 middle secim      29  1915
## # ... with 2,596 more rows
headline_words <- headline_words %>% 
  bind_tf_idf(word, np_ide, n)

headline_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(np_ide) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = np_ide)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~np_ide, ncol = 2, scales = "free") +
coord_flip()

5 Topic Modeling

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when the researcher is not sure what s/he is looking for.Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.(Silge & Robinson, 2018)

5.1 All Newspapers

Extracting 2 topics from all newspapers.

# Vectorizing the headlines
v_headline <- data$Headline

# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)

#cleaning corpus

headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))

# creating the Document Term Matrix

library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)

library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))

headline_topics <- tidy(headline_lda, matrix = "beta")

library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

headline_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

The first topic can be labelled as “Violence” and second as “Integrity”.

5.2 Pro government newspapers

#filtering the data for pro government news papers

data_pro <- data %>% 
  filter(np_ide == "pro")

# Vectorizing the headlines
v_headline <- data_pro$Headline

# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)

#cleaning corpus

headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))

# creating the Document Term Matrix

library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)

library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))

headline_topics <- tidy(headline_lda, matrix = "beta")

library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
headline_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

5.3 Anti government newspapers

#filtering the data for anti government news papers

data_anti <- data %>% 
  filter(np_ide == "anti")

# Vectorizing the headlines
v_headline <- data_anti$Headline

# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)

#cleaning corpus

headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))

# creating the Document Term Matrix

library(topicmodels)
headline_dtm <- DocumentTermMatrix(headline_corpus)

library(tidytext)
headline_lda <- LDA(headline_dtm, k = 2, control = list(seed = 1234))

headline_topics <- tidy(headline_lda, matrix = "beta")

library(ggplot2)
library(dplyr)
headline_top_terms <- headline_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
headline_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

6 Correlations

The following analysis provides some correlation analysis grouped by newspaper ideologies

data_pro <- data %>% 
  filter(np_ide == "pro")

# Vectorizing the headlines
v_headline <- data_pro$Headline

# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)

#cleaning corpus

headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))


# creating the termdocument matrix
headline_tdm <- TermDocumentMatrix(headline_corpus)

6.1 Corelations for word HDP in pro government nespapers

library(qdap)
# hdp corelations
cor.hdp <- findAssocs(headline_tdm, "hdp", 0.2)


# converting corelations do dtabase

cor.hdp.df <- list_vect2df(cor.hdp)[, 2:3]

# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.hdp.df, aes(y = cor.hdp.df [, 1])) + 
  geom_point(aes(x = cor.hdp.df [, 2]), 
             data = cor.hdp.df , size = 3) + 
  theme_gdocs()

6.2 Corelations for word CHP in pro government nespapers

library(qdap)

# chp corelations
cor.chp <- findAssocs(headline_tdm, "chp", 0.2)


# converting corelations do dtabase

cor.chp.df <- list_vect2df(cor.chp)[, 2:3]

# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.chp.df, aes(y = cor.chp.df [, 1])) + 
  geom_point(aes(x = cor.chp.df [, 2]), 
             data = cor.chp.df , size = 3) + 
  theme_gdocs()

6.3 Corelations for word AKP in anti government nespapers

library(qdap)


data_pro <- data %>% 
  filter(np_ide == "anti")

# Vectorizing the headlines
v_headline <- data_anti$Headline

# creating corpus
library(tm)
source <- VectorSource(v_headline)
headline_corpus <- VCorpus(source)

#cleaning corpus

headline_corpus <- tm_map(headline_corpus, removePunctuation)
headline_corpus <- tm_map(headline_corpus, content_transformer(tolower))
headline_corpus <- tm_map(headline_corpus, removeWords, c(stopwords("en")))


# creating the termdocument matrix
headline_tdm <- TermDocumentMatrix(headline_corpus)


#corelations
cor.akp <- findAssocs(headline_tdm, "akp", 0.2)


# converting corelations do dtabase

cor.akp.df <- list_vect2df(cor.akp)[, 2:3]

# grafikleyelim
library(ggplot2)
library(ggthemes)
ggplot(cor.akp.df, aes(y = cor.akp.df[, 1])) + 
  geom_point(aes(x = cor.akp.df [, 2]), 
             data = cor.akp.df , size = 3) + 
  theme_gdocs()

7 N-Grams

# creating the tibble for analysis

ngram_hl <- data %>% 
  select(Headline, np_ide)

as_tibble(ngram_hl)
## # A tibble: 1,358 x 2
##    Headline                                      np_ide
##    <chr>                                         <chr> 
##  1 MHP belediye imkanlarini kullaniyor           anti  
##  2 AKP zayifliyor                                anti  
##  3 HDP secim aracina saldiri                     anti  
##  4 Kaymakam mi yoksa AKP ilce baskani mi         anti  
##  5 Koyluye mektuplu degil askerli akrepli tehdit anti  
##  6 Cami altina AKP burosu                        anti  
##  7 Cami altina AKP burosu                        anti  
##  8 Her fabrika bir secim burosu                  anti  
##  9 Erdogan'in mitingine katilim ricasi           anti  
## 10 Davutoglu ozurlu diye konustu                 anti  
## # ... with 1,348 more rows
# Creating the bigrams from headlines
library(dplyr)
library(tidytext)
headline_bigrams <- ngram_hl %>% 
  unnest_tokens(bigram, Headline, token = "ngrams", n = 2)

# Sorting the bigrams

headline_bigrams %>% 
  count(bigram, sort = T)
## # A tibble: 3,835 x 2
##    bigram                n
##    <chr>             <int>
##  1 her yerde            34
##  2 oy pusulasi          18
##  3 oy ve                18
##  4 secim sonuclarina    17
##  5 ve otesi             17
##  6 yerde yolsuzluk      17
##  7 14 ilde              16
##  8 ilde itiraz          16
##  9 itiraz var           16
## 10 onbin oyda           16
## # ... with 3,825 more rows
#tf_idf analysis

headline_tf_idf <- headline_bigrams %>% 
  count(np_ide, bigram) %>% 
  bind_tf_idf(bigram, np_ide, n) %>%
  arrange(desc(tf_idf))

# Graph
library(ggplot2)
headline_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(np_ide) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = np_ide)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~np_ide, ncol = 2, scales = "free") +
coord_flip()

8 Network Analysis

library(igraph)

# seperating the bi grams

library(tidyr)
bigrams_separated <- headline_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% 
count(word1, word2, sort = TRUE)

bigram_graph <- bigrams_separated %>%
filter(n > 5) %>%
graph_from_data_frame()


#graph
library(ggraph)
a <- grid::arrow(type = "closed", length = unit(.10, "inches"))

ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()