1 Executive summary
2 Context and objective
3 Data provenance and unit of analysis
- 3.1 Key definitions
4 Data description and diagnostics
- 4.1 Data cleaning and preprocessing
- 4.2 Data QA checks
5 Temporal and institutional coverage
6 Initial descriptive analysis
7 Temporal Patterns
8 Spatial Patterns
9 Reproducibility notes
10 Key decisions and next steps

1 Executive summary

Dataset scope: public schools in the Federal District, school-day unit, 498,677 observations from 2016-01-01 to 2019-12-31.
Quality checks: identifier standardization, date consolidation, attendance consistency checks, and coordinate parsing with deterministic imputation for a small subset of schools.
Core findings (descriptive): absenteeism rates are right-skewed with substantial cross-school heterogeneity; temporal patterns show seasonality and weekly cycles.
Key risks: unbalanced panel coverage and specialized institutions with atypical reporting require explicit sample handling.
Next steps: define analytic samples, finalize temporal and spatial summaries, and prepare for model-based analysis.

2 Context and objective

School absenteeism is a central concern in educational research and public policy, as it is closely associated with learning outcomes, student engagement, and longer-term socioeconomic trajectories. Understanding its temporal dynamics and institutional heterogeneity is particularly relevant in large urban education systems, where schools operate under diverse contextual and organizational conditions.

This report presents an initial exploratory analysis of a large administrative dataset covering daily attendance records for public schools in the Federal District of Brazil. The purpose of this document is to assess the structure, coverage, and internal consistency of the data, as well as to provide preliminary descriptive evidence on patterns of absenteeism across schools and over time.

The analysis conducted here is intentionally descriptive and diagnostic in nature. No causal claims are made, and no multivariate modeling is attempted at this stage. Instead, the focus is on identifying key data features, potential quality issues, and broad empirical regularities that will inform subsequent data-cleaning procedures and the design of future analytical models.

This exploratory step is intended to support methodological decisions in the next phases of the research, including variable construction, sample definition, and the choice of appropriate empirical strategies.

3 Data provenance and unit of analysis

The dataset is an administrative extract of daily attendance records for public schools in the Federal District of Brazil. The unit of analysis is the school-day. The table below summarizes coverage.

Data Coverage and Unit of Analysis
Start Date	End Date	Unique Days	Unique Schools	Total Observations
2016-01-01	2019-12-31	1,461	632	498,677

3.1 Key definitions

Absenteeism rate: total_faltas / (total_presencas + total_faltas + total_justificadas) at the school-day level.
For comparability across schools with different reporting practices, justified absences are kept in the denominator, since they remain part of the recorded attendance accounting and define total observed student-days.
For economic and epidemiological interpretations, an alternative “effective absence” measure can also be defined by treating justified absences as non-attendance: absent_effective = (total_faltas + total_justificadas) / (total_presencas + total_faltas + total_justificadas). This alternative aligns more closely with interpretations related to learning time loss and illness-related absenteeism; however, it is not explored in this report and is reserved for subsequent model-based and robustness analyses.
Sample handling distinguishes regular panel coverage from specialized or limited-coverage institutions, as detailed in later sections.

4 Data description and diagnostics

The dataset analyzed in this report consists of a large administrative panel with daily attendance information at the school level for public schools in the Federal District of Brazil. Each observation corresponds to a specific school-day combination, allowing the analysis of both temporal variation within schools and cross-sectional heterogeneity across institutions.

In total, the dataset contains 498,677 observations and 23 variables. The variables can be grouped into four broad categories: (i) school identifiers and institutional characteristics, (ii) educational and administrative classifications, (iii) attendance-related counts, and (iv) temporal and spatial information. Most institutional variables are constant within schools, while attendance measures vary daily.

From a structural perspective, the unit of observation and the panel nature of the data are well defined. However, several data-quality issues are immediately apparent and are documented below, as they have direct implications for subsequent data preparation and modeling decisions.

Dataset Structure Summary
Variable	Type	Missing
…1	numeric	0
code_school	numeric	542
name_school	character	542
education_level	character	542
education_level_others	character	151308
admin_category	character	542
address	character	542
government_level	character	542
regulated_education_council	character	542
service_restriction	character	542
size	character	542
urban	character	542
location_type	character	542
date_update	Date	542
dia_data	Date	0
cep_escola	numeric	2724
count	numeric	0
ano_nascimento	numeric	542
total_faltas	numeric	0
total_presencas	numeric	0
total_justificadas	numeric	0
data	Date	0
geom	character	0

The school identifier (code_school) is currently stored as a numeric variable. Given its role as a categorical identifier rather than a quantitative measure, this variable should be converted to character format to avoid precision loss and unintended behavior in joins or grouping operations. Additionally, an index-like variable (...1) appears to be a residual artifact from data import procedures and does not carry analytical relevance.

The dataset includes two date variables (dia_data and data), both stored as Date objects. Their coexistence suggests redundancy or inconsistent naming across data sources. A single reference date variable will need to be selected after verifying their equivalence.

A more serious issue is observed in the variable labeled ano_nascimento, which contains values incompatible with valid year formats. This strongly suggests data corruption, misclassification, or the unintended inclusion of an anonymized identifier. Given its current state, this variable is excluded from the initial exploratory analysis until further clarification is obtained.

Attendance-related variables (total_presencas, total_faltas, and total_justificadas) are well defined as numeric counts and form the analytical core of this study. These variables are conceptually consistent and vary at the school-day level, making them suitable for both descriptive analysis and subsequent modeling of absenteeism dynamics.

An additional count variable (count) appears to represent the total number of enrolled or expected students per school-day, although its precise definition will require confirmation. Together, these variables enable the construction of derived indicators such as absenteeism rates and adjusted attendance measures.

Spatial information is provided through a geometry variable (geom), currently stored as a character string containing coordinate pairs. In its present format, this variable is not directly compatible with standard spatial analysis workflows in R (e.g., sf objects). Spatial processing is therefore deferred to later stages, following appropriate parsing and validation of the coordinate information.

4.1 Data cleaning and preprocessing

Before proceeding to descriptive analyses, a structured data-cleaning and preprocessing step was conducted to establish a coherent and analytically reliable baseline. Given the administrative origin and large scale of the dataset, this process prioritized clarity of identifiers, consistency of attendance measures, consolidation of temporal information, and validation of spatial metadata.

The purpose of this section is not to document all possible future transformations, but to explicitly describe the initial decisions required to make the dataset suitable for systematic exploration and subsequent modeling. All transformations applied at this stage are transparent, limited in scope, and reversible.

First 6 rows of the cleaned dataset (Subset of columns)
code_school	name_school	education_level	education_level_others	admin_category	address	government_level	regulated_education_council	service_restriction	size	urban	location_type	dia_data	cep_escola	count	ano_nascimento	total_faltas	total_presencas	total_justificadas	data	geom	total_attendance	longitude	latitude
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-03	70200620	2110	1996.365	1335	770	5	2018-04-03	c(-47.8821157, -15.8094245)	2110	-47.88212	-15.80942
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-04	70200620	1480	1993.818	880	585	15	2018-04-04	c(-47.8821157, -15.8094245)	1480	-47.88212	-15.80942
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-06	70200620	1675	1994.179	1130	535	10	2018-04-06	c(-47.8821157, -15.8094245)	1675	-47.88212	-15.80942
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-10	70200620	2120	1996.387	1370	735	15	2018-04-10	c(-47.8821157, -15.8094245)	2120	-47.88212	-15.80942
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-11	70200620	1490	1993.732	815	640	35	2018-04-11	c(-47.8821157, -15.8094245)	1490	-47.88212	-15.80942
53000234	CEJA ASA SUL - CESAS	Educação de Jovens Adultos	Atendimento Educacional Especializado	Pública	QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF.	Estadual	Sim	ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO	Mais de 1000 matrículas de escolarização	Urbana	A escola não está em área de localização diferenciada	2018-04-13	70200620	1715	1994.157	1130	555	30	2018-04-13	c(-47.8821157, -15.8094245)	1715	-47.88212	-15.80942

At this stage, non-informative artifacts generated during data import were removed, and key identifiers were standardized to character format to avoid numerical precision issues. A derived attendance measure was constructed to facilitate internal consistency checks between reported presences, absences, and total counts.

Spatial information was initially provided as a character string containing coordinate pairs. These values were parsed into numeric longitude and latitude variables, enabling basic validation while postponing full spatial object construction to a later analytical stage.

Schools with Missing Coordinates (Before Imputation)
code_school	cep_escola	geom
53005341	73030050	c(NaN, NaN)
53005767	73151010	c(NaN, NaN)
53008944	72301700	c(NaN, NaN)
53008979	72669425	c(NaN, NaN)
53011066	72600412	c(NaN, NaN)
53012127	72631115	c(NaN, NaN)
53012666	73307994	c(NaN, NaN)
NA	NA	c(NA, NA)

A small subset of schools presented missing spatial coordinates after the initial parsing step. For these cases, latitude and longitude values were recovered using externally verified school-level location information.

This imputation procedure was limited to a clearly identified and small number of institutions and does not rely on statistical interpolation. Instead, it represents a deterministic correction of missing metadata based on external sources. All imputed values are explicitly documented and reproducible.

Finally, the cleaned dataset was converted into a spatial object using a standard geographic coordinate reference system (WGS 84). This transformation enables future spatial analyses while preserving the original coordinate variables for transparency and compatibility with non-spatial workflows.

Overall, the resulting dataset provides a consistent and well-documented foundation for exploratory analysis of school absenteeism at high temporal resolution. The cleaning steps undertaken here explicitly address identifiable data-quality issues while maintaining a cautious and modular approach. Subsequent sections build on this baseline to examine temporal coverage and descriptive patterns, with these preprocessing decisions kept explicitly in view.

4.2 Data QA checks

The checks below formalize basic invariants and summarize the impact of cleaning decisions. Each item reports the count and share of affected rows.

Data QA Checks and Cleaning Impact
Check	Affected Rows	Percent	Action
Rows in raw dataset	498,677	100%	Baseline
Rows after cleaning	498,135	99.89%	Derived after cleaning
Rows with missing coordinates (pre-imputation)	8	0%	Imputed deterministically
Rows imputed (coordinates)	7	0%	Imputed deterministically
Rows dropped after coordinate filter	542	0.11%	Dropped if still missing
Attendance identity mismatch	0	0%	Flagged for review
Non-negative attendance violations	0	0%	Flagged for review
Absenteeism rate outside [0, 1]	0	0%	Flagged for review
Duplicate key rows	4,666	0.94%	Flagged for review

5 Temporal and institutional coverage

This section describes the effective temporal and institutional coverage of the cleaned dataset. The objective is to document the observation window, the number of schools represented, and the continuity of observations over time, without introducing substantive interpretation of outcomes.

Understanding these dimensions is essential to contextualize subsequent descriptive results and to assess whether the data structure supports different empirical strategies in later stages.

Summary of Dataset Coverage
Start Date	End Date	Unique Days	Unique Schools	Total Observations
2016-01-04	2019-12-20	1,011	631	498,135

The dataset encompasses a multi-year period from January 4, 2016, to December 20, 2019, with daily observations at the school level. The total number of observations amounts to 498,135, reflecting the combination of 631 distinct schools and a total of 1,011 unique days of reporting. This temporal and institutional coverage provides a robust foundation for analyzing patterns of absenteeism across a significant portion of the educational landscape in the Federal District during this period.

Summary of Observed Days per School-Year
Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
1	204	211	212.98	221	251

Distribution of Days Observed per School
Percentile	Days Observed
5%	200
25%	204
50%	211
75%	221
95%	235

Most schools exhibit substantial temporal coverage, with a median number of observed days that represents a large fraction of the total observation window. A smaller subset of schools presents more limited coverage, which will require attention in subsequent analyses, particularly when constructing longitudinal comparisons or defining analytical samples.

The distribution of the number of observed days per school indicates that the panel is not perfectly balanced. While many schools are observed for a large share of the total period, others enter or exit the dataset at different points in time. This unbalanced structure is typical of administrative education data and reflects institutional openings, closures, reporting gaps, or changes in operational status.

The distribution of the number of observed days per school indicates that the panel is largely well covered, with most institutions reporting attendance information for a substantial portion of the academic year. Median coverage exceeds two hundred school days, suggesting high temporal continuity for the majority of schools in the dataset.

A small number of schools exhibit markedly lower numbers of observed days. Closer inspection reveals that these cases are not random reporting failures, but are predominantly associated with institutions providing specialized or complementary educational services, such as special education and targeted support programs. These units typically operate under distinct organizational arrangements and do not follow the standard daily attendance structure of regular schools.

As a result, the presence of schools with limited temporal coverage reflects institutional heterogeneity rather than data-quality deficiencies. This distinction is important for subsequent analyses, as it motivates differentiated treatment of school types and careful definition of analytical samples when constructing longitudinal comparisons or aggregate indicators.

Low-Coverage School-Years (First 20 Rows)
year(data)	code_school	education_level	education_level_others	n_days_obs	first_date	last_date
2018	53000234	Educação de Jovens Adultos	Atendimento Educacional Especializado	161	2018-02-16	2018-12-21
2016	53001451	Ensino Fundamental	Atendimento Educacional Especializado	171	2016-02-29	2016-12-23
2016	53001486	Ensino Fundamental	Atendimento Educacional Especializado	163	2016-02-29	2016-12-23
2016	53001516	Ensino Fundamental	Atendimento Educacional Especializado	161	2016-02-29	2016-12-23
2016	53001540	Ensino Fundamental	NA	180	2016-02-29	2016-12-28
2016	53001630	Ensino Fundamental	Atendimento Educacional Especializado	163	2016-02-29	2016-12-21
2017	53001737	Ensino Fundamental	Atendimento Educacional Especializado	8	2017-02-10	2017-02-21
2016	53002270	Educação Infantil	NA	2	2016-02-29	2016-03-01
2018	53007743	Ensino Fundamental	NA	1	2018-02-15	2018-02-15
2016	53008839	Ensino Fundamental	NA	8	2016-02-29	2016-03-09
2016	53008847	Ensino Fundamental	Atendimento Educacional Especializado	171	2016-02-29	2016-12-23
2016	53012798	Ensino Fundamental	NA	169	2016-02-29	2016-12-22
2016	53013557	Ensino Fundamental	Atendimento Educacional Especializado	3	2016-02-29	2016-03-02
2017	53017277	Educação Infantil	NA	15	2017-05-02	2017-07-03
2017	53017323	Educação Infantil	NA	168	2017-04-03	2017-12-21
2017	53068165	Ensino Fundamental	Atendimento Educacional Especializado, Atividade Complementar	1	2017-02-10	2017-02-10

Distribution of Number of Active Schools per Day
Percentile	Active Schools
5%	53.5
25%	516.0
50%	562.0
75%	602.0
95%	627.0

Coverage diagnostics: active schools by day and observed days per school-year

The number of active schools varies over time, indicating changes in institutional participation or reporting. This variation further reinforces the unbalanced nature of the panel and motivates careful consideration of sample definitions in later stages of the analysis.

Overall, the dataset provides broad temporal coverage and includes a large number of schools observed at high frequency. While the panel is unbalanced, its structure is consistent with the characteristics of large-scale administrative education data. These features are taken into account in the descriptive analyses that follow and will inform modeling choices in subsequent phases of the research.

6 Initial descriptive analysis

This section presents an initial descriptive characterization of school absenteeism using the cleaned dataset. The objective is to establish baseline levels, dispersion, and heterogeneity across schools, without introducing causal interpretation or model-based inference.

Given the institutional heterogeneity documented in the previous section, descriptive results are presented for the full sample and, when relevant, with explicit reference to differences between regular schools and institutions providing specialized or complementary educational services.

Absenteeism is measured as the ratio between total absences and total recorded attendance on each school-day. This measure provides a scale-free indicator that facilitates comparison across institutions of different sizes.

Analytical Sample Comparison (Regular Panel vs Specialized Coverage)
Sample Group	Schools	Median Days per Year	Mean Absenteeism	P25	P75
Regular panel	631	211	10.7%	10.7%	10.7%
Specialized/limited coverage	16	161	15.1%	15.1%	15.1%

Descriptive Statistics of Daily Absenteeism Rates
Mean	Median	Std. Dev.	25th Percentile	75th Percentile
0.108	0.0946	0.0744	0.0617	0.1392

At the aggregate level, absenteeism rates exhibit substantial dispersion, indicating marked heterogeneity in attendance behavior across schools and over time. Median values provide a useful benchmark, as the distribution is skewed by extreme observations.

The distribution of absenteeism rates is right-skewed, with a concentration of observations at relatively low levels and a long upper tail. This pattern is consistent with episodic spikes in absences, potentially driven by institutional, temporal, or contextual factors.

Absenteeism Rates by Education Level
education_level	n	mean	median	p25	p75
Educação Infantil	47730	0.1536	0.1402	0.1085	0.1805
Educação Infantil, Ensino Fundamental	140149	0.0956	0.0865	0.0604	0.1176
Educação Infantil, Ensino Fundamental, Educação de Jovens Adultos	7031	0.1295	0.1111	0.0717	0.1710
Educação Infantil, Ensino Fundamental, Ensino Médio	1729	0.0555	0.0532	0.0295	0.0751
Educação Infantil, Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos	4132	0.1257	0.1182	0.0800	0.1641
Educação de Jovens Adultos	815	0.4680	0.4804	0.4038	0.5382
Ensino Fundamental	177589	0.0867	0.0764	0.0509	0.1085
Ensino Fundamental, Educação de Jovens Adultos	43992	0.1289	0.1226	0.0793	0.1694
Ensino Fundamental, Ensino Médio	14133	0.1226	0.1144	0.0754	0.1590
Ensino Fundamental, Ensino Médio, Educação Profissional	894	0.0904	0.0891	0.0579	0.1185
Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos	18845	0.1355	0.1291	0.0877	0.1788
Ensino Médio	23538	0.1335	0.1343	0.0820	0.1781
Ensino Médio, Educação Profissional	1299	0.0757	0.0588	0.0218	0.1010
Ensino Médio, Educação de Jovens Adultos	16259	0.1528	0.1463	0.0952	0.1996

Absenteeism Rates by Education Level (Other)
education_level_others	n	mean	median	p25	p75
Atendimento Educacional Especializado	211721	0.1057	0.0912	0.0598	0.1358
Atendimento Educacional Especializado, Atividade Complementar	91572	0.1051	0.0945	0.0623	0.1359
Atividade Complementar	44076	0.1104	0.0957	0.0615	0.1422
NA	150766	0.1122	0.0992	0.0645	0.1447

Summary Statistics of School-Average Absenteeism Rates
Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0.0334	0.0811	0.0993	0.1088	0.1295	0.5104

When aggregating absenteeism rates at the school level, substantial heterogeneity becomes evident. While many schools exhibit relatively low average absenteeism, a non-negligible subset displays persistently higher levels, reinforcing the importance of accounting for institutional differences in subsequent analyses. The dispersion observed across schools highlights that absenteeism is not uniformly distributed within the education system. This heterogeneity motivates stratified analyses and the exploration of covariates that may explain persistent differences across institutions.

Distribution of Daily Absenteeism Rates by Education Level

The figure above presents the distribution of daily absenteeism rates by education level. Clear differences across modalities are observed. Early childhood education tends to exhibit lower and more concentrated absenteeism rates, while secondary education displays greater dispersion.

Education of Young Adults (EJA) shows systematically higher absenteeism levels and a distribution shifted toward larger values. This pattern is consistent with the specific attendance dynamics of this modality, which typically involves older students, flexible schedules, and different engagement constraints.

Combined education levels and specialized modalities present wider and less regular distributions, reflecting their heterogeneous institutional arrangements and limited temporal coverage. These differences reinforce the importance of distinguishing education levels when analyzing absenteeism patterns and motivate stratified or modality-specific analyses in subsequent stages.

Overall, the descriptive evidence presented in this section establishes key empirical regularities of school absenteeism at high temporal resolution. The observed levels and dispersion provide a factual baseline for the analysis of temporal patterns and institutional correlates explored in the next sections.

7 Temporal Patterns

To characterize the temporal dynamics of absenteeism, we first aggregate the data at different time scales: daily, monthly, and by day of the week.

We visualize these aggregated patterns to identify trends, seasonality, and weekly cycles.

Key temporal findings: - Absenteeism shows clear seasonality across months with recurring peaks and troughs. - Weekly cycles are evident, with systematic differences by day of week. - The smoothed daily series indicates gradual trend variation rather than abrupt breaks.

Cautions: - Changes in the number of active schools over time can mechanically shift aggregate rates. - Calendar effects (school holidays and administrative reporting gaps) may influence daily dynamics.

8 Spatial Patterns

The map below displays the spatial distribution of average absenteeism rates across schools. No spatial statistics are computed at this stage; the goal is to identify potential spatial clusters or gradients visually.

Spatial clustering appears visually suggestive in the point and grid maps. Formal spatial autocorrelation tests are deferred to later stages of the analysis.

9 Reproducibility notes

All results were generated with a fixed random seed (1234). Package versions and session details are recorded below.

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.utf8    
## 
## time zone: America/Toronto
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] geobr_1.9.1      scales_1.4.0     kableExtra_1.4.0 terra_1.8-93    
##  [5] maptiles_0.11.0  ggspatial_1.1.10 sf_1.0-24        janitor_2.2.1   
##  [9] skimr_2.2.2      lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0   
## [13] dplyr_1.2.0      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [17] tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    viridisLite_0.4.3   farver_2.1.2       
##  [4] S7_0.2.1            fastmap_1.2.0       digest_0.6.39      
##  [7] timechange_0.4.0    lifecycle_1.0.5     magrittr_2.0.4     
## [10] compiler_4.5.2      rlang_1.1.7         sass_0.4.10        
## [13] tools_4.5.2         yaml_2.3.12         data.table_1.18.2.1
## [16] knitr_1.51          labeling_0.4.3      bit_4.6.0          
## [19] classInt_0.4-11     curl_7.0.0          xml2_1.5.2         
## [22] repr_1.1.7          RColorBrewer_1.1-3  abind_1.4-8        
## [25] KernSmooth_2.23-26  withr_3.0.2         grid_4.5.2         
## [28] e1071_1.7-17        cli_3.6.5           rmarkdown_2.30     
## [31] crayon_1.5.3        generics_0.1.4      otel_0.2.0         
## [34] rstudioapi_0.18.0   tzdb_0.5.0          DBI_1.2.3          
## [37] cachem_1.1.0        proxy_0.4-29        splines_4.5.2      
## [40] parallel_4.5.2      s2_1.1.9            base64enc_0.1-6    
## [43] vctrs_0.7.1         Matrix_1.7-4        jsonlite_2.0.0     
## [46] hms_1.1.4           bit64_4.6.0-1       systemfonts_1.3.1  
## [49] jquerylib_0.1.4     units_1.0-0         glue_1.8.0         
## [52] codetools_0.2-20    stringi_1.8.7       gtable_0.3.6       
## [55] pillar_1.11.1       htmltools_0.5.9     R6_2.6.1           
## [58] wk_0.9.5            textshaping_1.0.4   vroom_1.7.0        
## [61] evaluate_1.0.5      lattice_0.22-7      snakecase_0.11.1   
## [64] bslib_0.10.0        class_7.3-23        Rcpp_1.1.1         
## [67] svglite_2.2.2       nlme_3.1-168        mgcv_1.9-3         
## [70] xfun_0.56           fs_1.6.6            pkgconfig_2.0.3

10 Key decisions and next steps

10.1 Key decisions

Use a regular-panel definition of ≥ 200 observed days per school-year for longitudinal analyses, ensuring comparability and reducing bias driven by intermittent reporting.
Treat specialized or limited-coverage institutions (e.g., special education services, complementary programs, atypical calendars) as a separate analytical sample, to avoid conflating institutional heterogeneity with data-quality issues.
In model-based work, include explicit controls for calendar effects (e.g., day-of-week, month/term structure, holidays where available) and for school activity patterns (e.g., number of active days, enrollment/attendance volume), to reduce confounding from operational variation.

10.2 Next analytical steps (within the current dataset)

Finalize the temporal exploration (seasonality, day-of-week cycles, and within-year dynamics) stratified by education level and school type (regular vs specialized).
Define the primary outcome(s) to be modeled in later stages (e.g., daily absenteeism rate, log-odds transformation, or count models with offsets) and pre-specify inclusion rules (regular panel vs full sample).
Produce a concise set of “model-readiness” diagnostics: distributional checks, influence of extreme observations, and sensitivity to alternative denominators (e.g., excluding justified absences from the numerator, if conceptually required).

10.3 Data enrichment opportunities for future research

Beyond the exploratory analysis, the dataset can be significantly enriched through linkage with external sources to support more substantive research questions. Potential extensions include:

Educational outcomes (IDEB / proficiency metrics): link school identifiers to IDEB (or other standardized performance measures) to study whether persistent absenteeism patterns correlate with learning outcomes and school performance profiles. This enables descriptive benchmarking and, later, quasi-experimental designs if policy timing or shocks can be identified.
Local epidemiological cycles (DATASUS): merge school-day observations with municipality/region-level indicators of respiratory infections or other relevant morbidity proxies (e.g., influenza-like illness, hospital admissions). This supports tests of whether absenteeism spikes track epidemiological dynamics, with appropriate lags and seasonal controls.
Air pollution and exposure proxies: integrate pollution measures (monitoring stations, modeled surfaces, or satellite-derived products when available). In the absence of direct pollution series, construct exposure proxies using roadway density and traffic-related indicators (e.g., kernel density of major roads) combined with meteorological controls, acknowledging measurement error explicitly.
Urban greenness (vegetation indices and land cover): attach school-level greenness measures (e.g., NDVI buffers, land-cover shares, canopy cover when available) to assess whether environmental amenities correlate with attendance patterns. This is particularly relevant for spatial heterogeneity and for studies of environmental inequality.
Thermal comfort and heat/dryness extremes (climate exposure): integrate meteorological data to capture heat stress and thermal discomfort as potential drivers of absenteeism. This can be operationalized using daily temperature (mean/max), humidity, and derived indices such as heat index / apparent temperature or UTCI (when inputs are available). In Brasília/DF, dryness and large diurnal ranges are also relevant; therefore, including relative humidity, precipitation, and drought proxies can help capture discomfort associated with low humidity and sustained dry spells. These linkages enable analyses of short-run effects (same-day and lag structures), threshold behavior (extreme heat days), and interactions with school type and urban exposure (e.g., greenness and road density).

These enrichments would allow the transition from descriptive exploration to hypothesis-driven analyses, while preserving transparency about data linkage assumptions, spatial/temporal alignment, and potential confounding.

Initial Exploratory Analysis of School Absenteeism in the Federal District

Preliminary data diagnostics and descriptive evidence

Thiago Gardin

February 2026