Eksi Sozluk Data for Political Popular Culture

INTRODUCTION

Data Collection Process

For this project, we rely on Ekşi Sözlük as a primary data source because it offers a uniquely rich, user-generated record of public discourse in Turkey. Ekşi Sözlük functions somewhat like a hybrid between a forum and a wiki: users open “topics” (entries) and others contribute comments under those topics over time. Unlike traditional social media, discussions are organized chronologically within a fixed topic, which makes it especially useful for tracing how reactions evolve around specific events.

One important feature for our research is that authors can revise and update their entries retrospectively. This means posts are not strictly static snapshots; they may reflect evolving interpretations or corrections. While this introduces some complexity, it also provides insight into how narratives and opinions shift over time.

Relevant Topics

We focus on five key topics related to Kızılcık Şerbeti, each capturing different aspects of audience reaction:

  1. https://eksisozluk.com/kizilcik-serbeti-dizi--7435046 This is the main discussion thread, with over 1,600 pages (~16,000 comments). Starting around page 57, users discuss the Nursema episode, which becomes a focal point of controversy. In the 70s pages, discussions shift toward the RTÜK ban, showing how attention transitions from narrative content to regulatory issues.

  2. https://eksisozluk.com/kizilcik-serbeti-nursema--7606493

A smaller, more focused thread (~20 comments) dedicated specifically to the Nursema storyline.

  1. https://eksisozluk.com/14-nisan-2023-rtuk-kizilcik-serbeti-savasi--7637498

A single-page thread capturing immediate reactions to the RTÜK-related developments on April 14, 2023.

  1. https://eksisozluk.com/kizilcik-serbeti-senaristinin-gozaltina-alinmasi--8026864

Around 2 pages (~20 comments) discussing the detention of the show’s screenwriter.

  1. https://eksisozluk.com/kizilcik-serbetine-5-hafta-yayinlanmama-cezasi--7632299 Another short thread (~20 comments) focusing on the broadcasting ban imposed on the series.

Together, these threads allow us to capture both large-scale discourse (from the main thread) and event-specific reactions (from smaller, focused threads).

Data Collection Approach

There is currently no established R or Python package designed specifically for scraping Ekşi Sözlük. As a result, I developed a custom scraping script tailored to the platform’s structure.

Here’s what I did in detail:

  • Targeted the main topic first (the largest dataset), since it contains the bulk of the discussion (~16,000 comments).
  • Iterated through all available pages (over 1,600), which required handling pagination systematically.
  • For each entry, I extracted:
    • Text content (the main body of the comment)
    • Date
    • Time

These were selected because they are the most analytically useful fields:

The text allows for content analysis (sentiment, themes, discourse patterns).

The timestamp (date + time) enables temporal analysis, such as tracking spikes in discussion around key events (e.g., the Nursema episode or RTÜK decisions).

Overall, Ekşi Sözlük provides a longitudinal, topic-centered, and user-driven dataset, making it particularly well-suited for studying how public discourse unfolds in response to media events. - Justify using EksiSozluk - Make note of all eksisozluk links here - authors can revise their posts and update it over time

Exploratory Data Analysis

In this dataset, we have six variables:

  1. text: This variable contains the full textual content of the entry written by the user. It is the primary variable for analysis, as it captures opinions, reactions, and narratives related to the topic.
  2. page: This indicates the page number on which the entry appears within the topic. Since Ekşi Sözlük organizes entries in a paginated format (typically 10 entries per page), this variable provides a rough sense of position and sequence within the overall discussion.
  3. initial_date: This is the original date when the entry was first posted by the author. It reflects when the comment initially entered the discussion and is crucial for constructing a timeline of reactions.
  4. initial_time: This records the exact time of day (e.g., hour and minute) when the entry was first posted. Combined with initial_date, it allows for fine-grained temporal analysis, such as identifying bursts of activity within a single day.
  5. last_date: This variable captures the date of the most recent edit made to the entry. Because Ekşi Sözlük allows users to revise their posts, this field helps track whether and when content has been updated after its initial publication.
  6. last_time: This is the time of the most recent edit. Together with last_date, it provides a complete timestamp for the latest modification, enabling us to distinguish between original and revised content.

Overall, the distinction between initial and last timestamps is particularly important in this context, as it reflects the platform’s editable nature and allows us to account for temporal dynamics not just in posting behavior, but also in post-publication revisions.

# Variable names
names(ks_df)
[1] "text"         "page"         "id"           "initial_date" "initial_time"
[6] "last_date"    "last_time"    "text_clean"  
# Quick look at the variables
skim(ks_df)
Data summary
Name ks_df
Number of rows 15898
Number of columns 8
_______________________
Column type frequency:
character 4
Date 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
text 0 1.0 11 14307 0 15892 0
initial_time 0 1.0 5 5 0 1392 0
last_time 12687 0.2 5 5 0 1049 0
text_clean 0 1.0 11 14295 0 15892 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
initial_date 0 1.00 2022-10-10 2026-03-06 2024-05-31 894
last_date 15305 0.04 2022-10-11 2026-03-03 2024-03-23 335

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
page 0 1 808.58 465.81 1 406.00 808.0 1212.00 1614 ▇▇▇▇▇
id 0 1 8081.32 4658.11 2 4052.25 8079.5 12118.75 16140 ▇▇▇▇▇

Key Dates in Our Data

To contextualize the temporal patterns in the Ekşi Sözlük data, we define three critical dates corresponding to major real-world events that likely influenced online discussions:

ban_date (2023-03-22): This marks the date when Kızılcık Şerbeti faced a temporary broadcasting ban imposed by RTÜK. This event triggered significant public debate, both about the show’s content and broader issues of media regulation. We expect to see a noticeable spike in activity and shifts in sentiment around this date.

election_date (2023-05-28): This corresponds to the second round of the 2023 Turkish presidential election. Although not directly related to the show, national elections often shape public discourse, potentially influencing how audiences interpret themes in the series (e.g., politics, social values). This serves as an important external benchmark.

arrest_date (2025-12-15): This represents the date when the screenwriter of the series was reportedly taken into custody, an event discussed in one of the Ekşi Sözlük threads. While this occurs later than the main broadcast controversies, it provides a useful point for examining how discussions resurface or evolve in response to new developments.

These dates allow us to anchor the dataset in real-world events and conduct event-based temporal analysis, such as:

  • Comparing pre- and post-event discussion intensity
  • Identifying shifts in narrative or sentiment
  • Detecting delayed reactions or renewed attention

In short, they act as reference points for interpreting fluctuations in user-generated content over time.

# Dates to consider
ban_date <- as.Date("2023-03-22")
election_date <- as.Date("2023-05-28")
arrest_date <- as.Date("2025-12-15")

Activity Over Time

Let’s look at the daily number of posts. I also added a column that accounts for the number of days relative to the ban and election date. Most posts do not appear to happen immediately after the ban/election, but rather around the airing of the season finale episodes, which is likely to be the main driver of engagement.

# A tibble: 20 × 6
   initial_date n_posts ban_relative days_label election_relative
   <date>         <int>        <int> <chr>                  <int>
 1 2024-06-07       286          443 +443 days                376
 2 2024-06-08       280          444 +444 days                377
 3 2023-12-15       203          268 +268 days                201
 4 2024-03-09       179          353 +353 days                286
 5 2025-01-10       170          660 +660 days                593
 6 2024-09-13       148          541 +541 days                474
 7 2024-03-15       146          359 +359 days                292
 8 2024-12-27       135          646 +646 days                579
 9 2024-03-08       130          352 +352 days                285
10 2023-12-01       124          254 +254 days                187
11 2023-04-14       121           23 +23 days                 -44
12 2023-06-10       115           80 +80 days                  13
13 2025-03-14       115          723 +723 days                656
14 2025-04-12       115          752 +752 days                685
15 2024-02-23       112          338 +338 days                271
16 2024-03-29       111          373 +373 days                306
17 2025-09-13       111          906 +906 days                839
18 2023-12-16       110          269 +269 days                202
19 2025-02-28       110          709 +709 days                642
20 2023-06-09       108           79 +79 days                  12
# ℹ 1 more variable: days_label_elec <chr>

Below, you can find all eksisozluk posts in 2023 covering the TV show ban and the national election in the upcoming months. We can clearly see that the TV ban led to more engagement. We should also consider the episodes that aired near the end of the year, as these episodes have higher engagement.

If we include all years in this data, we see that the TV ban and the election are still one of the most significant spikes, but there is also a noticeable increase in posts around the end of the year in 2023, which likely corresponds to the airing of new episodes. The arresting of the scriptwriter in September also shows a smaller spike, indicating renewed interest or controversy around that time.

Instead of daily posts, we can also look at the monthly share of posts, but again highest engagement does not appear to be coinciding with the ban of the show.

Nursema Effect

I also particularly looked at mentions of Nursema in the posts. Ahead of the RTUK ban, “nursema” mentions are increasing and they peak around the ban date, which makes sense since the ban was largely driven by the controversy around that storyline. However, after the ban, mentions of Nursema drop sharply and never really recover to pre-ban levels, even during the election period. This suggests that while Nursema was a key driver of discussion leading up to the ban, it did not maintain its prominence in the discourse afterward, possibly because the conversation shifted more toward the regulatory and political implications rather than the content of the show itself.

KWIC Approach

I wand to start with Keyword-in-context (KWIC) approach. The first tier establishes the empirical baseline: how often do politically salient terms appear in the corpus, and does that frequency change around key external events? We begin with KWIC (keyword-in-context) concordances, which display each hit surrounded by its immediate context window.

This is a deliberate first step — reading concordances before counting anything ensures the keyword lists are capturing the intended phenomenon rather than spurious matches. A word like “özgürlük” (freedom) can appear in both secular and religious framings; the concordance tells you which is dominant in this corpus before you commit to including it in a category.

Once the keyword lists are validated, we count hits per entry and aggregate to weekly rates — hits per 100 entries rather than raw counts, so that weeks with more entries do not artificially inflate frequency.

The resulting time series data, plotted with the ban and election dates as reference lines, gives the reader an immediate visual sense of whether political discourse around the show intensified at politically salient moments. This is descriptive, not inferential — the causal testing comes in tier 4 — but it motivates every subsequent analytical step and provides the most accessible evidence for a general audience.

This is invaluable for validating that your keywords actually capture political engagement rather than casual use — for instance “kadın” might appear in a purely gossip context rather than a rights context, and KWIC will reveal that quickly.

KWIC Analysis for Keywords

I decided to look at certain keywords on religion, secularism, gender, politics, and specific to nursema. We mostly have entries about religion, gender, and politics. But, nursema is also important figure here. If we did “doga” or “fatih”, I would expect similar results because these are main characters and it is hard to say whether they are used in political context or simply an assessment of these characters.

# Political keyword groups ----
# Define thematically — you can expand each after inspecting KWIC output
pol_keywords <- list(
  religion = c("başörtü*", "türban*", "dindar*", "muhafazakar*", "namaz*",
                  "imam*", "tarikat*", "müslüman*", "helal*", "haram*"),
  secularism = c("laik*", "atatürk*", "cumhuriyet*", "seküler*", "kemalist*"),
  gender = c("kadın*", "erkek*", "feminist*", "şiddet*", "taciz*",
                  "namus*", "eşitlik*"),
  politics = c("seçim*", "iktidar*", "muhalefet*", "akp*", "chp*",
                  "erdoğan*", "propaganda*", "siyasi*", "oy*"),
  nursema = c("nursema*")   # keep separate — proper noun, specific signal
)

# KWIC for qualitative inspection ----
# Do this first — read the concordances before you count anything
kwic_results <- list(
  religion = kwic(toks, pattern = phrase(pol_keywords$religion), window = 6),
  secularism = kwic(toks, pattern = phrase(pol_keywords$secularism), window = 6),
  gender = kwic(toks, pattern = phrase(pol_keywords$gender), window = 6),
  politics = kwic(toks, pattern = phrase(pol_keywords$politics), window = 6),
  nursema = kwic(toks, pattern = "nursema*", window = 8)
)

# Quick count per group
map(kwic_results, nrow)
$religion
[1] 1141

$secularism
[1] 512

$gender
[1] 1935

$politics
[1] 973

$nursema
[1] 1079

Plotting KWIC Themes

Let’s plot these keywords – wow, politics is the dominant theme before the ban! Great support for our hypothesis. This especially happens pre-ban – we could think about this in detail.

Keyness Analysis

In KWICH approach we can talk about how often political keywords appear. In keyness analysis, we can ask a harder question: what do commenters actually do with those words?

Two complementary methods address this.

Collocations identify which words habitually co-occur within a fixed window around a target keyword. If “kadın” (woman) most frequently appears alongside “kapalı” (covered), “namus” (honour), and “dindar” (devout) rather than alongside “eşitlik” (equality) or “özgür” (free), that pattern constitutes empirical evidence of a specific framing — the show’s female characters are being discussed primarily through a religious-conservative lens. Collocations make that framing legible as a quantitative pattern rather than a qualitative assertion.

Keyness analysis compares two sub-corpora — for instance, entries written before and after the May 2023 election — and identifies which words are statistically over- or under-represented in one period relative to the other. We use the log-likelihood ratio (G²) rather than chi-squared because it is more robust when the two corpora differ substantially in size, which is the case here given that the post-election period contains more entries.

A word with a high positive G² score appeared disproportionately more often after the event; a high negative score means it receded. Taken together, collocations and keyness shift the argument from “political keywords are frequent” to “here is the specific vocabulary through which political meaning is constructed in this discourse, and it changed in a systematic direction around these dates.

So, the logic here is: collocations tell you what travels with your political keywords (framing), and keyness tells you what distinguishes discourse before vs. after each event date (shift). Together they let you say something like “after the election, religious framing intensified while secular vocabulary declined” — with numbers behind it.

Keyness – Pre vs Post Comparison

It looks like prior to the election, there were more political keywords (see in red). Again, supports Lisel’s theory.

Collocation Plots

Which words are associated with nursema, religion, and secularism themes?

Wordfish Approach

Wordfish is a statistical method that reads through a large collection of texts and automatically arranges them along a single axis based purely on the vocabulary patterns it finds — without being told in advance what to look for. The core intuition is simple: people who are writing from a similar perspective tend to use similar words, and people writing from opposing perspectives tend to use systematically different words. Wordfish exploits this by finding the vocabulary dimension that best separates the texts from one another.

In our case, it reads every Ekşi Sözlük entry and assigns each one a position score — entries that cluster at one end of the axis tend to share a certain vocabulary, and entries at the other end share a different one. Crucially, the method does not know what those vocabularies mean politically — that interpretation is our job. Once the model has run, we inspect which words are pulling entries toward each pole: if religious and conservative terminology loads at one end while secular and oppositional vocabulary loads at the other, we can label the axis accordingly and say the model has recovered a meaningful ideological dimension from the data.

To measure the intensity of different political framings across entries, we constructed a custom four-category dictionary based on vocabulary patterns identified in the collocation and keyness analyses.

  • The religious category captures terms associated with conservative and religious identity, including words relating to the headscarf, piety, religious orders, and Islamic practice.
  • The secular category captures the opposing discursive tradition, centering on references to Kemalism, the republic, and laicism.
  • The gender threat category identifies entries foregrounding violence, harassment, and patriarchal norms — relevant because the show’s central dramatic tension revolves around gender relations between its secular and religious characters.
  • Finally, the political explicit category captures direct references to electoral politics, party names, and political actors, which allows us to distinguish entries that engage with the show’s political subtext implicitly through cultural framing from those that name the political stakes outright.

Each category uses wildcard matching so that inflected forms of a root are captured together — for instance, laik* matches laik, laiklik, and laikçi without requiring each form to be listed separately. This is particularly important for Turkish given its agglutinative morphology, where a single root can surface in dozens of grammatical forms. For each entry, we compute the proportion of tokens matching each category and scale the result to hits per 1,000 tokens, which normalises for entry length and makes scores comparable across entries of different sizes.

Plotting gender, secularism, conservative, and political categories

It looks like plotting the average score per week for each category gives us a clear picture of how different political framings fluctuate over time, especially around key events like the ban and the election. We can see that the religious framing (in red) spikes prior the ban, while the politics framing (in purple) shows a more gradual increase after the ban, leading up to the election, and post-election.

Again this is sensitive to keywords we selected, so we can be more meaningful for our word choices.

Inferential Testing: Interrupted Time Series Approach

In this section, we will ask whether those patterns we observe in the previous sections represent genuine structural shifts or could be explained by random variation. We use two complementary approaches: interrupted time series (ITS) regression, which tests whether each political event produced a statistically significant change in discourse, and correlation/regression models that ask whether dictionary scores and Wordfish positions are systematically related to observable features of the data (time, period, entry length). Together these move the argument from “it looks like discourse changed around the election” to “discourse changed significantly, by this magnitude, at this point in time.”

In this section, we use ITS approach. ITS is the standard quasi-experimental design for observational time series data when you have a known intervention date but no control group. The model estimates four things: (1) the pre-event trend — was discourse already changing before the event? (2) the level change — did the mean jump immediately at the event? (3) the slope change — did the rate of change shift after the event? (4) residual variation unexplained by the model. We fit this for both event dates simultaneously, using religious framing score as the primary outcome because tier 3 suggested it was most responsive to political shocks. We then repeat for theta and political_explicit as robustness checks.

# A tibble: 3 × 6
  name                mean median    sd   max zeros
  <chr>              <dbl>  <dbl> <dbl> <dbl> <dbl>
1 political_explicit 140.       0  322.  1000 0.813
2 religious          161.       0  329.  1000 0.774
3 secular             63.3      0  202.  1000 0.881
# A tibble: 1 × 12
  religious_mean religious_median religious_sd secular_mean secular_median
           <dbl>            <dbl>        <dbl>        <dbl>          <dbl>
1           211.             168.         115.         50.8           50.5
# ℹ 7 more variables: secular_sd <dbl>, gender_threat_mean <dbl>,
#   gender_threat_median <dbl>, gender_threat_sd <dbl>,
#   political_explicit_mean <dbl>, political_explicit_median <dbl>,
#   political_explicit_sd <dbl>
# A tibble: 1 × 3
  n_periods median_entries min_entries
      <int>          <int>       <int>
1        31             52           2
═══ Period model: religious ═══

t test of coefficients:

                       Estimate Std. Error t value  Pr(>|t|)    
(Intercept)            276.5345    31.5134  8.7751 < 2.2e-16 ***
periodban_to_election -100.9875    25.4180 -3.9731 7.301e-05 ***
periodpost_election   -145.1400    21.8178 -6.6524 3.550e-11 ***
log_tokens               3.1549     6.6719  0.4729    0.6364    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
═══ Period model: secular ═══

t test of coefficients:

                      Estimate Std. Error t value  Pr(>|t|)    
(Intercept)            60.3877    12.2129  4.9446 8.151e-07 ***
periodban_to_election -25.5136    10.0714 -2.5333  0.011363 *  
periodpost_election   -27.3110     8.7740 -3.1127  0.001875 ** 
log_tokens              3.1536     2.6518  1.1893  0.234456    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
═══ Period model: political ═══

t test of coefficients:

                      Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           128.3845    29.1857  4.3989 1.135e-05 ***
periodban_to_election  28.3065    22.6824  1.2480    0.2122    
periodpost_election   -10.0222    18.4388 -0.5435    0.5868    
log_tokens              6.5317     6.4673  1.0099    0.3126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
═══ Period model: theta ═══

t test of coefficients:

                        Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           -0.3101731  0.0839916 -3.6929 0.0002266 ***
periodban_to_election  0.0078865  0.0645818  0.1221 0.9028170    
periodpost_election    0.4269244  0.0507189  8.4175 < 2.2e-16 ***
log_tokens             0.0111695  0.0206205  0.5417 0.5880945    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Political explicit shows no significant period effects, which is actually interpretable. Direct political references — party names, election vocabulary, erdoğan — do not significantly change across periods. This suggests the explicitly political vocabulary was a stable undercurrent throughout rather than something triggered by events.

Religious framing is your most striking finding. The intercept of 277 represents the pre-ban baseline — religious vocabulary was already heavily present from the start. But it then drops significantly in the ban-to-election period (−101, p < 0.001) and drops even further post-election (−145, p < 0.001). This is counterintuitive at first glance — you might have expected religious framing to increase around a politically charged election. The more likely interpretation is that the ban itself provoked an initial burst of religious discourse that was already baked into the pre-ban baseline, and as the show continued commenters shifted toward other registers. Or alternatively, the pre-ban period had a smaller but more politically engaged audience who wrote in explicitly religious terms, while post-election the audience broadened and diluted that signal.

Secular framing tells a consistent and theoretically meaningful story alongside religious. It also declines significantly in both the ban-to-election period (−25, p = 0.011) and post-election (−27, p = 0.002) relative to pre-ban. Crucially, both religious and secular framing decline over time — which suggests the discourse is not simply shifting from one pole to the other but moving away from explicit ideological vocabulary altogether as the show becomes more mainstream.

QUESTIONS FOR LISEL

  • In Section “Key Dates in Our Data” – Any suggestions for key dates you want to look at? Also, I primarily focus on 2023 in this analysis. Strecthing this to contemporary times can convulate the results, and might not be helpful.

  • In Section “Nursema Effect” – I particularly looked at nursema keyword, but is there any other keyword we should be looking at?

  • In Section “KWIC Analysis for Keywords” – any suggestions for keywords or subjects that you want to look at?

  • In Section “Inferential Testing” – any suggestions on whether this is useful?