Source: The Guardian (https://theguardian.com), Photograph: Mario Tama/Getty Images
Introduction
Violence against women is a persistent and alarming public health and human rights issue worldwide. Understanding its patterns, frequency, and the demographic groups most affected is essential to designing effective policies, prevention campaigns and support networks.
The Sistema de Informação de Agravos de Notificação (SINAN) compiles reports of violence from across Brazil, offering a vital lens through which to understand the scope and characteristics of this epidemic.
I chose to work with the “SINAN Violência 2017–2019” dataset available on Kaggle. The dataset includes anonymized records of reported violence in Brazil from 2017 to 2019.
This dataset is particularly meaningful to me as a Brazilian woman and mother. Gender-based violence affects millions of lives, often remaining underreported due to social stigma or systemic barriers. As someone passionate about communication and social justice, I want to use data science not only to uncover patterns in this violence but also to advocate for better protective and preventive measures. Through visualization and statistical analysis, I hope to illuminate the conditions and demographics most vulnerable to such violence.
Variables
The dataset contains a wide variety of variables, including:
ID: a unique identifier for each report (categorical, nominal).
DT_OCOR: date of the incident (quantitative, date/time).
SEXO, RACA_COR, IDADE: demographic info on the victim (categorical and quantitative).
UF_OCOR, MUN_OCOR: location of occurrence (categorical).
According to the World Health Organization (WHO) and UN Women, Brazil ranks among the highest in Latin America for femicide and gender-based violence. A 2020 report from the Brazilian Public Safety Forum revealed that a woman is assaulted every two minutes in the country. The situation is often exacerbated by underreporting, limited access to support services, and cultural stigma. Policies like the Maria da Penha Law, enacted in 2006, have been critical, but enforcement and awareness remain inconsistent across states.
The SINAN data captures only reported incidents—real figures may be much higher. Still, these records provide valuable insights into demographic, geographic, and temporal patterns of violence, offering a starting point for more nuanced inquiry and intervention.
Data Analysis
Load Libraries
library(readr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)library(tidyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(lubridate) # data handling - convert birthdate column
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
colnames(sinan_data) <-tolower(colnames(sinan_data)) #lower case
#change date when the violence occurred for only the yearsinan_data$dt_ocor <-as.Date(sinan_data$dt_ocor)sinan_data$year_ocor <-format(sinan_data$dt_ocor, "%Y")head(sinan_data[, c("dt_ocor", "year_ocor")])
#calculate age in yearssinan_data$dt_nasc <-as.Date(sinan_data$dt_nasc) # convert birthdate to Date classsinan_data$idade_calc <-as.numeric(difftime(Sys.Date(), sinan_data$dt_nasc, units ="days")) %/%365# zalculate age in years using today's datehead(sinan_data[, c("dt_nasc", "idade_calc")])
#example 1yearly_counts <- sinan_data %>%filter(!is.na(year_ocor)) %>%group_by(year_ocor) %>%summarise(total_casos =n(), .groups ="drop")filtered_year_data <- yearly_counts %>%filter(year_ocor %in%c(2017, 2018, 2019))ggplot(filtered_year_data, aes(x = year_ocor, y = total_casos, fill = year_ocor)) +geom_col(show.legend =FALSE) +labs(title ="Number of Cases by Year (2017 - 2019)",x ="Year",y ="Total Number of Cases",caption ="Source: SINAN - Sistema de Informação de Agravos de Notificação" ) +theme_minimal() +scale_fill_brewer(palette ="Set2")
#example 2filtered_year_data_age <- sinan_data %>%filter(!is.na(idade_calc), !is.na(year_ocor)) %>%group_by(idade_calc, year_ocor) %>%summarise(total_casos =n(), .groups ="drop")# Then: Filter for the years 2016–2019filtered_year_data_age <- filtered_year_data_age %>%filter(year_ocor %in%c(2017, 2018, 2019))# View the first few rowshead(filtered_year_data_age)
filtered_facet_data <- sinan_data %>%filter(!is.na(idade_calc), !is.na(year_ocor)) %>%filter(year_ocor %in%c(2017, 2018, 2019), idade_calc >=0, idade_calc <=100) %>%group_by(idade_calc, year_ocor) %>%summarise(total_casos =n(), .groups ="drop")# Step 2: Plot with facet_wrap by yearggplot(filtered_facet_data, aes(x = idade_calc, y = total_casos)) +geom_col(fill ="pink") +facet_wrap(~ year_ocor) +labs(title ="Violence Against Women by Age (0–100 anos)",x ="Age (Years)",y ="Total Number of Cases",caption ="Source: SINAN - Sistema de Informação de Agravos de Notificação" ) +theme_minimal()
#example 3filtered_color <- sinan_data %>%filter(!is.na(cs_raca), !is.na(year_ocor)) %>%filter(year_ocor %in%c(2017, 2018, 2019), idade_calc >=0, idade_calc <=100) %>%group_by(cs_raca, year_ocor) %>%summarise(total_casos =n(), .groups ="drop") ggplot(filtered_color, aes(x =reorder(cs_raca, -total_casos), y = total_casos, fill = cs_raca)) +geom_col(show.legend =FALSE) +labs(title ="Violence Cases by Ethnicity (2017 - 2019)",x ="Ethnicity",y ="Total Number of Cases",caption ="Source: SINAN - Sistema de Informação de Agravos de Notificação" ) +scale_fill_brewer(palette ="Set2") +theme_minimal()
# A tibble: 81 × 4
year_ocor level_0 mean_idade total_casos
<chr> <chr> <dbl> <int>
1 2017 AC 30.4 2051
2 2017 AL 34.9 3457
3 2017 AM 27.6 4311
4 2017 AP 31.8 503
5 2017 BA 35.8 9631
6 2017 CE 33.5 5505
7 2017 DF 29.9 2993
8 2017 ES 37.1 7074
9 2017 GO 33.3 6549
10 2017 MA 32.0 2280
# ℹ 71 more rows
# get unique states and generate viridis colorsunique_states <-unique(final_plot$level_0)num_states <-length(unique_states)viridis_colors <-viridis(num_states)names(viridis_colors) <- unique_statesplot_ly(data = final_plot,x =~year_ocor,y =~total_casos,type ='scatter',mode ='markers',color =~level_0,colors = viridis_colors,size =~mean_idade,text =~paste("<b>Year:</b>", year_ocor,"<br><b>State:</b>", level_0,"<br><b>Age (Years) Mean:</b>", round(mean_idade, 1),"<br><b>Total Number of Cases:</b>", total_casos ),marker =list(sizemode ='diameter',sizeref =2,line =list(width =1) # 🔧 Fix for the repeated warning ),hoverinfo ='text') %>%layout(title ="Violence Against Women in Brazil by State (2016 - 2019)",xaxis =list(title ="Year"),yaxis =list(title ="Total Number of Cases"),showlegend =TRUE )
Interpretation of the Visualization
My final visualization explore both temporal and demographic trends in reported violence against women in Brazil. The interactive plot highlights the number of incidents over time broken down by state (level_0). This approach enables a more nuanced understanding of how violence against women varies by location and demographic factors.
Several striking patterns emerged during this analysis. One of the most prominent observations was the consistent underreporting or lack of data in certain regions, particularly in the North and interior states of Brazil. This may reflect infrastructural disparities, cultural stigma, or systemic neglect in those areas. Furthermore, it became evident that younger women, particularly those in the 10–29 age range, experience a disproportionately high share of reported physical violence. Many incidents within this age group involve intimate partners or family members, underscoring the central role of domestic violence in this issue.
This project prompted me to reflect on both the power and the limits of data in addressing social issues. While data visualization can reveal trends and disparities, it cannot capture the full emotional or cultural context behind each case. One limitation I encountered was missing values in critical variables, and incomplete data for some states. I also wished I could have included GIS-based maps to show geographic patterns more clearly, but I struggled to align shapefiles with the LEVEL_0 codes during this phase.
Despite these challenges, the project was a meaningful and impactful experience. It allowed me to connect data science tools with real-world applications that have significant consequences for people’s lives. It also reinforced the importance of data literacy and ethical storytelling when working with sensitive topics like gender-based violence.
Bibliography
Fórum Brasileiro de Segurança Pública (2020). Anuário Brasileiro de Segurança Pública. Retrieved from: https://forumseguranca.org.br/
WHO (2021). Violence Against Women Prevalence Estimates, 2018. World Health Organization. Retrieved from: https://www.who.int/publications/i/item/9789240022256
UN Women Brazil. (2023). Facts and Figures: Ending Violence Against Women. Retrieved from: https://www.onumulheres.org.br
Coding Sources:
AI Assistance: OpenAI ChatGPT used for code explanation (April 2025).