Violence Against Women in Brazil

Author

Aline Mayrink

Activists in Rio de Janeiro highlight the problem of violence against women
Source: The Guardian (https://theguardian.com), Photograph: Mario Tama/Getty Images

Introduction

Violence against women is a persistent and alarming public health and human rights issue worldwide. Understanding its patterns, frequency, and the demographic groups most affected is essential to designing effective policies, prevention campaigns and support networks.

The Sistema de Informação de Agravos de Notificação (SINAN) compiles reports of violence from across Brazil, offering a vital lens through which to understand the scope and characteristics of this epidemic.

I chose to work with the “SINAN Violência 2017–2019” dataset available on Kaggle. The dataset includes anonymized records of reported violence in Brazil from 2017 to 2019.

This dataset is particularly meaningful to me as a Brazilian woman and mother. Gender-based violence affects millions of lives, often remaining underreported due to social stigma or systemic barriers. As someone passionate about communication and social justice, I want to use data science not only to uncover patterns in this violence but also to advocate for better protective and preventive measures. Through visualization and statistical analysis, I hope to illuminate the conditions and demographics most vulnerable to such violence.

Variables

The dataset contains a wide variety of variables, including:

ID: a unique identifier for each report (categorical, nominal).
DT_OCOR: date of the incident (quantitative, date/time).
SEXO, RACA_COR, IDADE: demographic info on the victim (categorical and quantitative).
UF_OCOR, MUN_OCOR: location of occurrence (categorical).
TRABALHADOR, ESCOLARIDADE: socioeconomic characteristics (categorical).
VIOL_FISICA, VIOL_PSIQUICA, VIOL_SEXUAL: types of violence experienced (categorical, binary flags).
AUTO_PROV, OUTRAS_PESSOAS: identifies the perpetrator (categorical).
Many others regarding circumstances and consequences of the incidents.

Data Source

The dataset was retrieved from Kaggle - SINAN - Sistema de Informação de Agravos de Notificação https://www.kaggle.com/datasets/tissianarosa/sinan-violencia-2017-2019

Background Research

According to the World Health Organization (WHO) and UN Women, Brazil ranks among the highest in Latin America for femicide and gender-based violence. A 2020 report from the Brazilian Public Safety Forum revealed that a woman is assaulted every two minutes in the country. The situation is often exacerbated by underreporting, limited access to support services, and cultural stigma. Policies like the Maria da Penha Law, enacted in 2006, have been critical, but enforcement and awareness remain inconsistent across states.

The SINAN data captures only reported incidents—real figures may be much higher. Still, these records provide valuable insights into demographic, geographic, and temporal patterns of violence, offering a starting point for more nuanced inquiry and intervention.

Data Analysis

Load Libraries

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(lubridate) # data handling - convert birthdate column


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(ggfortify)
library(viridis)

Loading required package: viridisLite

Load datasets

setwd("~/Desktop/DATA/Data Visualization 110/Project2_2")

sinan_data <- read_csv("SINAN-VIOL-2017-2019.csv")

Rows: 1063056 Columns: 161
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (15): level_0, ID_AGRAVO, ID_UNIDADE, CS_SEXO, CS_ESCOL_N, ID_OCUPA_N,...
dbl  (105): level_1, level_2, TP_NOT, SEM_NOT, NU_ANO, SG_UF_NOT, ID_MUNICIP...
lgl   (36): TP_UNI_EXT, NM_UNI_EXT, CO_UNI_EXT, NDUPLIC, DT_INVEST, CONS_ABO...
date   (4): DT_NOTIFIC, DT_OCOR, DT_NASC, DT_ENCERRA
time   (1): HORA_OCOR

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(sinan_data)

# A tibble: 6 × 161
  level_0 level_1 level_2 TP_NOT ID_AGRAVO DT_NOTIFIC SEM_NOT NU_ANO SG_UF_NOT
  <chr>     <dbl>   <dbl>  <dbl> <chr>     <date>       <dbl>  <dbl>     <dbl>
1 AC         2017       0      2 Y09       2017-01-01  201701   2017        12
2 AC         2017       1      2 Y09       2017-01-02  201701   2017        12
3 AC         2017       2      2 Y09       2017-01-02  201701   2017        12
4 AC         2017       3      2 Y09       2017-01-02  201701   2017        12
5 AC         2017       4      2 Y09       2017-01-02  201701   2017        12
6 AC         2017       5      2 Y09       2017-01-02  201701   2017        12
# ℹ 152 more variables: ID_MUNICIP <dbl>, TP_UNI_EXT <lgl>, NM_UNI_EXT <lgl>,
#   CO_UNI_EXT <lgl>, ID_UNIDADE <chr>, ID_REGIONA <dbl>, ID_RG_RESI <dbl>,
#   DT_OCOR <date>, SEM_PRI <dbl>, DT_NASC <date>, NU_IDADE_N <dbl>,
#   CS_SEXO <chr>, CS_GESTANT <dbl>, CS_RACA <dbl>, CS_ESCOL_N <chr>,
#   SG_UF <dbl>, ID_MN_RESI <dbl>, ID_PAIS <dbl>, NDUPLIC <lgl>,
#   DT_INVEST <lgl>, ID_OCUPA_N <chr>, SIT_CONJUG <dbl>, DEF_TRANS <dbl>,
#   DEF_FISICA <dbl>, DEF_MENTAL <dbl>, DEF_VISUAL <dbl>, DEF_AUDITI <dbl>, …

Data Cleaning

colnames(sinan_data) <- tolower(colnames(sinan_data)) #lower case

#change date when the violence occurred for only the year
sinan_data$dt_ocor <- as.Date(sinan_data$dt_ocor)
sinan_data$year_ocor <- format(sinan_data$dt_ocor, "%Y")

head(sinan_data[, c("dt_ocor", "year_ocor")])

# A tibble: 6 × 2
  dt_ocor    year_ocor
  <date>     <chr>    
1 2016-06-15 2016     
2 2016-12-31 2016     
3 2017-01-02 2017     
4 2016-12-31 2016     
5 2017-01-02 2017     
6 2017-01-01 2017

#calculate age in years
sinan_data$dt_nasc <- as.Date(sinan_data$dt_nasc) # convert birthdate to Date class
sinan_data$idade_calc <- as.numeric(difftime(Sys.Date(), sinan_data$dt_nasc, units = "days")) %/% 365 # zalculate age in years using today's date

head(sinan_data[, c("dt_nasc", "idade_calc")])

# A tibble: 6 × 2
  dt_nasc    idade_calc
  <date>          <dbl>
1 2002-05-03         22
2 1975-10-29         49
3 1980-04-22         45
4 1988-04-01         37
5 1971-08-06         53
6 1985-06-12         39

#summarize average age by gender
sinan_data %>%
  group_by(cs_sexo) %>%
  summarize(avg_idade = mean(idade_calc, na.rm = TRUE))

# A tibble: 3 × 2
  cs_sexo avg_idade
  <chr>       <dbl>
1 F            34.7
2 I            28.7
3 M            32.4

# convert CS_RACA to factor with appropriate labels
sinan_data$cs_raca <- as.factor(sinan_data$cs_raca)

# Recode using dplyr::recode
sinan_data$cs_raca <- dplyr::recode(sinan_data$cs_raca,
  `1` = "White",
  `2` = "Black",
  `3` = "Asian",
  `4` = "Brown",
  `5` = "Indigenous",
  `9` = "Ignored"
)

# View counts
table(sinan_data$cs_raca)


     White      Black      Asian      Brown Indigenous    Ignored 
    430261      83419       7550     428869       9963      97183

Linear Regression Analysis

Model Equation: idade_calc = β0 + β1 * dt_ocor + ε

lm_model <- lm(idade_calc ~ dt_ocor, data = sinan_data)

autoplot(lm_model, ncol = 2) +
  theme_minimal() +
  labs(title = "Linear Regression Diagnostic Plots",
       caption = "Model: idade_calc ~ dt_ocor | Source: SINAN (2016–2019)")

Visual Exploration

#example 1
yearly_counts <- sinan_data %>%
  filter(!is.na(year_ocor)) %>%
  group_by(year_ocor) %>%
  summarise(total_casos = n(), .groups = "drop")

filtered_year_data <- yearly_counts %>%
  filter(year_ocor %in% c(2017, 2018, 2019))

ggplot(filtered_year_data, aes(x = year_ocor, y = total_casos, fill = year_ocor)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Number of Cases by Year (2017 - 2019)",
    x = "Year",
    y = "Total Number of Cases",
    caption = "Source: SINAN - Sistema de Informação de Agravos de Notificação"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

#example 2
filtered_year_data_age <- sinan_data %>%
  filter(!is.na(idade_calc), !is.na(year_ocor)) %>%
  group_by(idade_calc, year_ocor) %>%
  summarise(total_casos = n(), .groups = "drop")

# Then: Filter for the years 2016–2019
filtered_year_data_age <- filtered_year_data_age %>%
  filter(year_ocor %in% c(2017, 2018, 2019))

# View the first few rows
head(filtered_year_data_age)

# A tibble: 6 × 3
  idade_calc year_ocor total_casos
       <dbl> <chr>           <int>
1          5 2019             3110
2          6 2018             3463
3          6 2019             7901
4          7 2017             3750
5          7 2018             7702
6          7 2019             6331

filtered_facet_data <- sinan_data %>%
  filter(!is.na(idade_calc), !is.na(year_ocor)) %>%
  filter(year_ocor %in% c(2017, 2018, 2019), idade_calc >= 0, idade_calc <= 100) %>%
  group_by(idade_calc, year_ocor) %>%
  summarise(total_casos = n(), .groups = "drop")

# Step 2: Plot with facet_wrap by year
ggplot(filtered_facet_data, aes(x = idade_calc, y = total_casos)) +
  geom_col(fill = "pink") +
  facet_wrap(~ year_ocor) +
  labs(
    title = "Violence Against Women by Age (0–100 anos)",
    x = "Age (Years)",
    y = "Total Number of Cases",
    caption = "Source: SINAN - Sistema de Informação de Agravos de Notificação"
  ) +
  theme_minimal()

#example 3
filtered_color <- sinan_data %>%
  filter(!is.na(cs_raca), !is.na(year_ocor)) %>%
  filter(year_ocor %in% c(2017, 2018, 2019), idade_calc >= 0, idade_calc <= 100) %>%
  group_by(cs_raca, year_ocor) %>%
  summarise(total_casos = n(), .groups = "drop") 

ggplot(filtered_color, aes(x = reorder(cs_raca, -total_casos), y = total_casos, fill = cs_raca)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Violence Cases by Ethnicity (2017 - 2019)",
    x = "Ethnicity",
    y = "Total Number of Cases",
    caption = "Source: SINAN - Sistema de Informação de Agravos de Notificação"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Final Visualization

final_plot <- sinan_data %>%
  filter(!is.na(year_ocor), !is.na(idade_calc), !is.na(level_0)) %>%
  group_by(year_ocor, level_0) %>%
  filter(year_ocor %in% c(2017, 2018, 2019), idade_calc >= 0, idade_calc <= 100) %>%
  summarise(
    mean_idade = mean(idade_calc, na.rm = TRUE),
    total_casos = n(),
    .groups = "drop"
  )
print(final_plot)

# A tibble: 81 × 4
   year_ocor level_0 mean_idade total_casos
   <chr>     <chr>        <dbl>       <int>
 1 2017      AC            30.4        2051
 2 2017      AL            34.9        3457
 3 2017      AM            27.6        4311
 4 2017      AP            31.8         503
 5 2017      BA            35.8        9631
 6 2017      CE            33.5        5505
 7 2017      DF            29.9        2993
 8 2017      ES            37.1        7074
 9 2017      GO            33.3        6549
10 2017      MA            32.0        2280
# ℹ 71 more rows

# get unique states and generate viridis colors
unique_states <- unique(final_plot$level_0)
num_states <- length(unique_states)
viridis_colors <- viridis(num_states)
names(viridis_colors) <- unique_states

plot_ly(
  data = final_plot,
  x = ~year_ocor,
  y = ~total_casos,
  type = 'scatter',
  mode = 'markers',
  color = ~level_0,
  colors = viridis_colors,
  size = ~mean_idade,
  text = ~paste(
    "<b>Year:</b>", year_ocor,
    "<br><b>State:</b>", level_0,
    "<br><b>Age (Years) Mean:</b>", round(mean_idade, 1),
    "<br><b>Total Number of Cases:</b>", total_casos
  ),
  marker = list(
    sizemode = 'diameter',
    sizeref = 2,
    line = list(width = 1)  # 🔧 Fix for the repeated warning
  ),
  hoverinfo = 'text'
) %>%
  layout(
    title = "Violence Against Women in Brazil by State (2016 - 2019)",
    xaxis = list(title = "Year"),
    yaxis = list(title = "Total Number of Cases"),
    showlegend = TRUE
  )

Interpretation of the Visualization

My final visualization explore both temporal and demographic trends in reported violence against women in Brazil. The interactive plot highlights the number of incidents over time broken down by state (level_0). This approach enables a more nuanced understanding of how violence against women varies by location and demographic factors.

Several striking patterns emerged during this analysis. One of the most prominent observations was the consistent underreporting or lack of data in certain regions, particularly in the North and interior states of Brazil. This may reflect infrastructural disparities, cultural stigma, or systemic neglect in those areas. Furthermore, it became evident that younger women, particularly those in the 10–29 age range, experience a disproportionately high share of reported physical violence. Many incidents within this age group involve intimate partners or family members, underscoring the central role of domestic violence in this issue.

This project prompted me to reflect on both the power and the limits of data in addressing social issues. While data visualization can reveal trends and disparities, it cannot capture the full emotional or cultural context behind each case. One limitation I encountered was missing values in critical variables, and incomplete data for some states. I also wished I could have included GIS-based maps to show geographic patterns more clearly, but I struggled to align shapefiles with the LEVEL_0 codes during this phase.

Despite these challenges, the project was a meaningful and impactful experience. It allowed me to connect data science tools with real-world applications that have significant consequences for people’s lives. It also reinforced the importance of data literacy and ethical storytelling when working with sensitive topics like gender-based violence.

Bibliography

Fórum Brasileiro de Segurança Pública (2020). Anuário Brasileiro de Segurança Pública. Retrieved from: https://forumseguranca.org.br/
WHO (2021). Violence Against Women Prevalence Estimates, 2018. World Health Organization. Retrieved from: https://www.who.int/publications/i/item/9789240022256
UN Women Brazil. (2023). Facts and Figures: Ending Violence Against Women. Retrieved from: https://www.onumulheres.org.br

Coding Sources:

AI Assistance: OpenAI ChatGPT used for code explanation (April 2025).
Lubridate Library - https://library.virginia.edu/data/articles/working-with-dates-and-time-in-r-using-the-lubridate-package