1 Executive summary

2 Context and objective

School absenteeism is a central concern in educational research and public policy, as it is closely associated with learning outcomes, student engagement, and longer-term socioeconomic trajectories. Understanding its temporal dynamics and institutional heterogeneity is particularly relevant in large urban education systems, where schools operate under diverse contextual and organizational conditions.

This report presents an initial exploratory analysis of a large administrative dataset covering daily attendance records for public schools in the Federal District of Brazil. The purpose of this document is to assess the structure, coverage, and internal consistency of the data, as well as to provide preliminary descriptive evidence on patterns of absenteeism across schools and over time.

The analysis conducted here is intentionally descriptive and diagnostic in nature. No causal claims are made, and no multivariate modeling is attempted at this stage. Instead, the focus is on identifying key data features, potential quality issues, and broad empirical regularities that will inform subsequent data-cleaning procedures and the design of future analytical models.

This exploratory step is intended to support methodological decisions in the next phases of the research, including variable construction, sample definition, and the choice of appropriate empirical strategies.

3 Data provenance and unit of analysis

The dataset is an administrative extract of daily attendance records for public schools in the Federal District of Brazil. The unit of analysis is the school-day. The table below summarizes coverage.

Data Coverage and Unit of Analysis
Start Date End Date Unique Days Unique Schools Total Observations
2016-01-01 2019-12-31 1,461 632 498,677

3.1 Key definitions

  • Absenteeism rate: total_faltas / (total_presencas + total_faltas + total_justificadas) at the school-day level.

  • For comparability across schools with different reporting practices, justified absences are kept in the denominator, since they remain part of the recorded attendance accounting and define total observed student-days.

  • For economic and epidemiological interpretations, an alternative “effective absence” measure can also be defined by treating justified absences as non-attendance: absent_effective = (total_faltas + total_justificadas) / (total_presencas + total_faltas + total_justificadas). This alternative aligns more closely with interpretations related to learning time loss and illness-related absenteeism; however, it is not explored in this report and is reserved for subsequent model-based and robustness analyses.

  • Sample handling distinguishes regular panel coverage from specialized or limited-coverage institutions, as detailed in later sections.

4 Data description and diagnostics

The dataset analyzed in this report consists of a large administrative panel with daily attendance information at the school level for public schools in the Federal District of Brazil. Each observation corresponds to a specific school-day combination, allowing the analysis of both temporal variation within schools and cross-sectional heterogeneity across institutions.

In total, the dataset contains 498,677 observations and 23 variables. The variables can be grouped into four broad categories: (i) school identifiers and institutional characteristics, (ii) educational and administrative classifications, (iii) attendance-related counts, and (iv) temporal and spatial information. Most institutional variables are constant within schools, while attendance measures vary daily.

From a structural perspective, the unit of observation and the panel nature of the data are well defined. However, several data-quality issues are immediately apparent and are documented below, as they have direct implications for subsequent data preparation and modeling decisions.
Dataset Structure Summary
Variable Type Missing
…1 numeric 0
code_school numeric 542
name_school character 542
education_level character 542
education_level_others character 151308
admin_category character 542
address character 542
government_level character 542
regulated_education_council character 542
service_restriction character 542
size character 542
urban character 542
location_type character 542
date_update Date 542
dia_data Date 0
cep_escola numeric 2724
count numeric 0
ano_nascimento numeric 542
total_faltas numeric 0
total_presencas numeric 0
total_justificadas numeric 0
data Date 0
geom character 0

The school identifier (code_school) is currently stored as a numeric variable. Given its role as a categorical identifier rather than a quantitative measure, this variable should be converted to character format to avoid precision loss and unintended behavior in joins or grouping operations. Additionally, an index-like variable (...1) appears to be a residual artifact from data import procedures and does not carry analytical relevance.

The dataset includes two date variables (dia_data and data), both stored as Date objects. Their coexistence suggests redundancy or inconsistent naming across data sources. A single reference date variable will need to be selected after verifying their equivalence.

A more serious issue is observed in the variable labeled ano_nascimento, which contains values incompatible with valid year formats. This strongly suggests data corruption, misclassification, or the unintended inclusion of an anonymized identifier. Given its current state, this variable is excluded from the initial exploratory analysis until further clarification is obtained.

Attendance-related variables (total_presencas, total_faltas, and total_justificadas) are well defined as numeric counts and form the analytical core of this study. These variables are conceptually consistent and vary at the school-day level, making them suitable for both descriptive analysis and subsequent modeling of absenteeism dynamics.

An additional count variable (count) appears to represent the total number of enrolled or expected students per school-day, although its precise definition will require confirmation. Together, these variables enable the construction of derived indicators such as absenteeism rates and adjusted attendance measures.

Spatial information is provided through a geometry variable (geom), currently stored as a character string containing coordinate pairs. In its present format, this variable is not directly compatible with standard spatial analysis workflows in R (e.g., sf objects). Spatial processing is therefore deferred to later stages, following appropriate parsing and validation of the coordinate information.

4.1 Data cleaning and preprocessing

Before proceeding to descriptive analyses, a structured data-cleaning and preprocessing step was conducted to establish a coherent and analytically reliable baseline. Given the administrative origin and large scale of the dataset, this process prioritized clarity of identifiers, consistency of attendance measures, consolidation of temporal information, and validation of spatial metadata.

The purpose of this section is not to document all possible future transformations, but to explicitly describe the initial decisions required to make the dataset suitable for systematic exploration and subsequent modeling. All transformations applied at this stage are transparent, limited in scope, and reversible.

First 6 rows of the cleaned dataset (Subset of columns)
code_school name_school education_level education_level_others admin_category address government_level regulated_education_council service_restriction size urban location_type dia_data cep_escola count ano_nascimento total_faltas total_presencas total_justificadas data geom total_attendance longitude latitude
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-03 70200620 2110 1996.365 1335 770 5 2018-04-03 c(-47.8821157, -15.8094245) 2110 -47.88212 -15.80942
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-04 70200620 1480 1993.818 880 585 15 2018-04-04 c(-47.8821157, -15.8094245) 1480 -47.88212 -15.80942
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-06 70200620 1675 1994.179 1130 535 10 2018-04-06 c(-47.8821157, -15.8094245) 1675 -47.88212 -15.80942
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-10 70200620 2120 1996.387 1370 735 15 2018-04-10 c(-47.8821157, -15.8094245) 2120 -47.88212 -15.80942
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-11 70200620 1490 1993.732 815 640 35 2018-04-11 c(-47.8821157, -15.8094245) 1490 -47.88212 -15.80942
53000234 CEJA ASA SUL - CESAS Educação de Jovens Adultos Atendimento Educacional Especializado Pública QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. Estadual Sim ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO Mais de 1000 matrículas de escolarização Urbana A escola não está em área de localização diferenciada 2018-04-13 70200620 1715 1994.157 1130 555 30 2018-04-13 c(-47.8821157, -15.8094245) 1715 -47.88212 -15.80942

At this stage, non-informative artifacts generated during data import were removed, and key identifiers were standardized to character format to avoid numerical precision issues. A derived attendance measure was constructed to facilitate internal consistency checks between reported presences, absences, and total counts.

Spatial information was initially provided as a character string containing coordinate pairs. These values were parsed into numeric longitude and latitude variables, enabling basic validation while postponing full spatial object construction to a later analytical stage.

Schools with Missing Coordinates (Before Imputation)
code_school cep_escola geom
53005341 73030050 c(NaN, NaN)
53005767 73151010 c(NaN, NaN)
53008944 72301700 c(NaN, NaN)
53008979 72669425 c(NaN, NaN)
53011066 72600412 c(NaN, NaN)
53012127 72631115 c(NaN, NaN)
53012666 73307994 c(NaN, NaN)
NA NA c(NA, NA)

A small subset of schools presented missing spatial coordinates after the initial parsing step. For these cases, latitude and longitude values were recovered using externally verified school-level location information.

This imputation procedure was limited to a clearly identified and small number of institutions and does not rely on statistical interpolation. Instead, it represents a deterministic correction of missing metadata based on external sources. All imputed values are explicitly documented and reproducible.

Finally, the cleaned dataset was converted into a spatial object using a standard geographic coordinate reference system (WGS 84). This transformation enables future spatial analyses while preserving the original coordinate variables for transparency and compatibility with non-spatial workflows.

Overall, the resulting dataset provides a consistent and well-documented foundation for exploratory analysis of school absenteeism at high temporal resolution. The cleaning steps undertaken here explicitly address identifiable data-quality issues while maintaining a cautious and modular approach. Subsequent sections build on this baseline to examine temporal coverage and descriptive patterns, with these preprocessing decisions kept explicitly in view.

4.2 Data QA checks

The checks below formalize basic invariants and summarize the impact of cleaning decisions. Each item reports the count and share of affected rows.

Data QA Checks and Cleaning Impact
Check Affected Rows Percent Action
Rows in raw dataset 498,677 100% Baseline
Rows after cleaning 498,135 99.89% Derived after cleaning
Rows with missing coordinates (pre-imputation) 8 0% Imputed deterministically
Rows imputed (coordinates) 7 0% Imputed deterministically
Rows dropped after coordinate filter 542 0.11% Dropped if still missing
Attendance identity mismatch 0 0% Flagged for review
Non-negative attendance violations 0 0% Flagged for review
Absenteeism rate outside [0, 1] 0 0% Flagged for review
Duplicate key rows 4,666 0.94% Flagged for review

5 Temporal and institutional coverage

This section describes the effective temporal and institutional coverage of the cleaned dataset. The objective is to document the observation window, the number of schools represented, and the continuity of observations over time, without introducing substantive interpretation of outcomes.

Understanding these dimensions is essential to contextualize subsequent descriptive results and to assess whether the data structure supports different empirical strategies in later stages.
Summary of Dataset Coverage
Start Date End Date Unique Days Unique Schools Total Observations
2016-01-04 2019-12-20 1,011 631 498,135
The dataset encompasses a multi-year period from January 4, 2016, to December 20, 2019, with daily observations at the school level. The total number of observations amounts to 498,135, reflecting the combination of 631 distinct schools and a total of 1,011 unique days of reporting. This temporal and institutional coverage provides a robust foundation for analyzing patterns of absenteeism across a significant portion of the educational landscape in the Federal District during this period.
Summary of Observed Days per School-Year
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 204 211 212.98 221 251
Distribution of Days Observed per School
Percentile Days Observed
5% 200
25% 204
50% 211
75% 221
95% 235

Most schools exhibit substantial temporal coverage, with a median number of observed days that represents a large fraction of the total observation window. A smaller subset of schools presents more limited coverage, which will require attention in subsequent analyses, particularly when constructing longitudinal comparisons or defining analytical samples.

The distribution of the number of observed days per school indicates that the panel is not perfectly balanced. While many schools are observed for a large share of the total period, others enter or exit the dataset at different points in time. This unbalanced structure is typical of administrative education data and reflects institutional openings, closures, reporting gaps, or changes in operational status.

The distribution of the number of observed days per school indicates that the panel is largely well covered, with most institutions reporting attendance information for a substantial portion of the academic year. Median coverage exceeds two hundred school days, suggesting high temporal continuity for the majority of schools in the dataset.

A small number of schools exhibit markedly lower numbers of observed days. Closer inspection reveals that these cases are not random reporting failures, but are predominantly associated with institutions providing specialized or complementary educational services, such as special education and targeted support programs. These units typically operate under distinct organizational arrangements and do not follow the standard daily attendance structure of regular schools.

As a result, the presence of schools with limited temporal coverage reflects institutional heterogeneity rather than data-quality deficiencies. This distinction is important for subsequent analyses, as it motivates differentiated treatment of school types and careful definition of analytical samples when constructing longitudinal comparisons or aggregate indicators.

Low-Coverage School-Years (First 20 Rows)
year(data) code_school education_level education_level_others n_days_obs first_date last_date
2018 53000234 Educação de Jovens Adultos Atendimento Educacional Especializado 161 2018-02-16 2018-12-21
2016 53001451 Ensino Fundamental Atendimento Educacional Especializado 171 2016-02-29 2016-12-23
2016 53001486 Ensino Fundamental Atendimento Educacional Especializado 163 2016-02-29 2016-12-23
2016 53001516 Ensino Fundamental Atendimento Educacional Especializado 161 2016-02-29 2016-12-23
2016 53001540 Ensino Fundamental NA 180 2016-02-29 2016-12-28
2016 53001630 Ensino Fundamental Atendimento Educacional Especializado 163 2016-02-29 2016-12-21
2017 53001737 Ensino Fundamental Atendimento Educacional Especializado 8 2017-02-10 2017-02-21
2016 53002270 Educação Infantil NA 2 2016-02-29 2016-03-01
2018 53007743 Ensino Fundamental NA 1 2018-02-15 2018-02-15
2016 53008839 Ensino Fundamental NA 8 2016-02-29 2016-03-09
2016 53008847 Ensino Fundamental Atendimento Educacional Especializado 171 2016-02-29 2016-12-23
2016 53012798 Ensino Fundamental NA 169 2016-02-29 2016-12-22
2016 53013557 Ensino Fundamental Atendimento Educacional Especializado 3 2016-02-29 2016-03-02
2017 53017277 Educação Infantil NA 15 2017-05-02 2017-07-03
2017 53017323 Educação Infantil NA 168 2017-04-03 2017-12-21
2017 53068165 Ensino Fundamental Atendimento Educacional Especializado, Atividade Complementar 1 2017-02-10 2017-02-10
Distribution of Number of Active Schools per Day
Percentile Active Schools
5% 53.5
25% 516.0
50% 562.0
75% 602.0
95% 627.0
Coverage diagnostics: active schools by day and observed days per school-year

Coverage diagnostics: active schools by day and observed days per school-year

Coverage diagnostics: active schools by day and observed days per school-year

Coverage diagnostics: active schools by day and observed days per school-year

The number of active schools varies over time, indicating changes in institutional participation or reporting. This variation further reinforces the unbalanced nature of the panel and motivates careful consideration of sample definitions in later stages of the analysis.

Overall, the dataset provides broad temporal coverage and includes a large number of schools observed at high frequency. While the panel is unbalanced, its structure is consistent with the characteristics of large-scale administrative education data. These features are taken into account in the descriptive analyses that follow and will inform modeling choices in subsequent phases of the research.

6 Initial descriptive analysis

This section presents an initial descriptive characterization of school absenteeism using the cleaned dataset. The objective is to establish baseline levels, dispersion, and heterogeneity across schools, without introducing causal interpretation or model-based inference.

Given the institutional heterogeneity documented in the previous section, descriptive results are presented for the full sample and, when relevant, with explicit reference to differences between regular schools and institutions providing specialized or complementary educational services.

Absenteeism is measured as the ratio between total absences and total recorded attendance on each school-day. This measure provides a scale-free indicator that facilitates comparison across institutions of different sizes.
Analytical Sample Comparison (Regular Panel vs Specialized Coverage)
Sample Group Schools Median Days per Year Mean Absenteeism P25 P75
Regular panel 631 211 10.7% 10.7% 10.7%
Specialized/limited coverage 16 161 15.1% 15.1% 15.1%
Descriptive Statistics of Daily Absenteeism Rates
Mean Median Std. Dev. 25th Percentile 75th Percentile
0.108 0.0946 0.0744 0.0617 0.1392

At the aggregate level, absenteeism rates exhibit substantial dispersion, indicating marked heterogeneity in attendance behavior across schools and over time. Median values provide a useful benchmark, as the distribution is skewed by extreme observations.

The distribution of absenteeism rates is right-skewed, with a concentration of observations at relatively low levels and a long upper tail. This pattern is consistent with episodic spikes in absences, potentially driven by institutional, temporal, or contextual factors.

Absenteeism Rates by Education Level
education_level n mean median p25 p75
Educação Infantil 47730 0.1536 0.1402 0.1085 0.1805
Educação Infantil, Ensino Fundamental 140149 0.0956 0.0865 0.0604 0.1176
Educação Infantil, Ensino Fundamental, Educação de Jovens Adultos 7031 0.1295 0.1111 0.0717 0.1710
Educação Infantil, Ensino Fundamental, Ensino Médio 1729 0.0555 0.0532 0.0295 0.0751
Educação Infantil, Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos 4132 0.1257 0.1182 0.0800 0.1641
Educação de Jovens Adultos 815 0.4680 0.4804 0.4038 0.5382
Ensino Fundamental 177589 0.0867 0.0764 0.0509 0.1085
Ensino Fundamental, Educação de Jovens Adultos 43992 0.1289 0.1226 0.0793 0.1694
Ensino Fundamental, Ensino Médio 14133 0.1226 0.1144 0.0754 0.1590
Ensino Fundamental, Ensino Médio, Educação Profissional 894 0.0904 0.0891 0.0579 0.1185
Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos 18845 0.1355 0.1291 0.0877 0.1788
Ensino Médio 23538 0.1335 0.1343 0.0820 0.1781
Ensino Médio, Educação Profissional 1299 0.0757 0.0588 0.0218 0.1010
Ensino Médio, Educação de Jovens Adultos 16259 0.1528 0.1463 0.0952 0.1996
Absenteeism Rates by Education Level (Other)
education_level_others n mean median p25 p75
Atendimento Educacional Especializado 211721 0.1057 0.0912 0.0598 0.1358
Atendimento Educacional Especializado, Atividade Complementar 91572 0.1051 0.0945 0.0623 0.1359
Atividade Complementar 44076 0.1104 0.0957 0.0615 0.1422
NA 150766 0.1122 0.0992 0.0645 0.1447
Summary Statistics of School-Average Absenteeism Rates
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0334 0.0811 0.0993 0.1088 0.1295 0.5104

When aggregating absenteeism rates at the school level, substantial heterogeneity becomes evident. While many schools exhibit relatively low average absenteeism, a non-negligible subset displays persistently higher levels, reinforcing the importance of accounting for institutional differences in subsequent analyses. The dispersion observed across schools highlights that absenteeism is not uniformly distributed within the education system. This heterogeneity motivates stratified analyses and the exploration of covariates that may explain persistent differences across institutions.

Distribution of Daily Absenteeism Rates by Education Level

Distribution of Daily Absenteeism Rates by Education Level

The figure above presents the distribution of daily absenteeism rates by education level. Clear differences across modalities are observed. Early childhood education tends to exhibit lower and more concentrated absenteeism rates, while secondary education displays greater dispersion.

Education of Young Adults (EJA) shows systematically higher absenteeism levels and a distribution shifted toward larger values. This pattern is consistent with the specific attendance dynamics of this modality, which typically involves older students, flexible schedules, and different engagement constraints.

Combined education levels and specialized modalities present wider and less regular distributions, reflecting their heterogeneous institutional arrangements and limited temporal coverage. These differences reinforce the importance of distinguishing education levels when analyzing absenteeism patterns and motivate stratified or modality-specific analyses in subsequent stages.

Overall, the descriptive evidence presented in this section establishes key empirical regularities of school absenteeism at high temporal resolution. The observed levels and dispersion provide a factual baseline for the analysis of temporal patterns and institutional correlates explored in the next sections.

7 Temporal Patterns

To characterize the temporal dynamics of absenteeism, we first aggregate the data at different time scales: daily, monthly, and by day of the week.

We visualize these aggregated patterns to identify trends, seasonality, and weekly cycles.

Key temporal findings: - Absenteeism shows clear seasonality across months with recurring peaks and troughs. - Weekly cycles are evident, with systematic differences by day of week. - The smoothed daily series indicates gradual trend variation rather than abrupt breaks.

Cautions: - Changes in the number of active schools over time can mechanically shift aggregate rates. - Calendar effects (school holidays and administrative reporting gaps) may influence daily dynamics.

8 Spatial Patterns

The map below displays the spatial distribution of average absenteeism rates across schools. No spatial statistics are computed at this stage; the goal is to identify potential spatial clusters or gradients visually.

Spatial clustering appears visually suggestive in the point and grid maps. Formal spatial autocorrelation tests are deferred to later stages of the analysis.

9 Reproducibility notes

All results were generated with a fixed random seed (1234). Package versions and session details are recorded below.

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.utf8    
## 
## time zone: America/Toronto
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] geobr_1.9.1      scales_1.4.0     kableExtra_1.4.0 terra_1.8-93    
##  [5] maptiles_0.11.0  ggspatial_1.1.10 sf_1.0-24        janitor_2.2.1   
##  [9] skimr_2.2.2      lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0   
## [13] dplyr_1.2.0      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [17] tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    viridisLite_0.4.3   farver_2.1.2       
##  [4] S7_0.2.1            fastmap_1.2.0       digest_0.6.39      
##  [7] timechange_0.4.0    lifecycle_1.0.5     magrittr_2.0.4     
## [10] compiler_4.5.2      rlang_1.1.7         sass_0.4.10        
## [13] tools_4.5.2         yaml_2.3.12         data.table_1.18.2.1
## [16] knitr_1.51          labeling_0.4.3      bit_4.6.0          
## [19] classInt_0.4-11     curl_7.0.0          xml2_1.5.2         
## [22] repr_1.1.7          RColorBrewer_1.1-3  abind_1.4-8        
## [25] KernSmooth_2.23-26  withr_3.0.2         grid_4.5.2         
## [28] e1071_1.7-17        cli_3.6.5           rmarkdown_2.30     
## [31] crayon_1.5.3        generics_0.1.4      otel_0.2.0         
## [34] rstudioapi_0.18.0   tzdb_0.5.0          DBI_1.2.3          
## [37] cachem_1.1.0        proxy_0.4-29        splines_4.5.2      
## [40] parallel_4.5.2      s2_1.1.9            base64enc_0.1-6    
## [43] vctrs_0.7.1         Matrix_1.7-4        jsonlite_2.0.0     
## [46] hms_1.1.4           bit64_4.6.0-1       systemfonts_1.3.1  
## [49] jquerylib_0.1.4     units_1.0-0         glue_1.8.0         
## [52] codetools_0.2-20    stringi_1.8.7       gtable_0.3.6       
## [55] pillar_1.11.1       htmltools_0.5.9     R6_2.6.1           
## [58] wk_0.9.5            textshaping_1.0.4   vroom_1.7.0        
## [61] evaluate_1.0.5      lattice_0.22-7      snakecase_0.11.1   
## [64] bslib_0.10.0        class_7.3-23        Rcpp_1.1.1         
## [67] svglite_2.2.2       nlme_3.1-168        mgcv_1.9-3         
## [70] xfun_0.56           fs_1.6.6            pkgconfig_2.0.3

10 Key decisions and next steps

10.1 Key decisions

  • Use a regular-panel definition of ≥ 200 observed days per school-year for longitudinal analyses, ensuring comparability and reducing bias driven by intermittent reporting.
  • Treat specialized or limited-coverage institutions (e.g., special education services, complementary programs, atypical calendars) as a separate analytical sample, to avoid conflating institutional heterogeneity with data-quality issues.
  • In model-based work, include explicit controls for calendar effects (e.g., day-of-week, month/term structure, holidays where available) and for school activity patterns (e.g., number of active days, enrollment/attendance volume), to reduce confounding from operational variation.

10.2 Next analytical steps (within the current dataset)

  • Finalize the temporal exploration (seasonality, day-of-week cycles, and within-year dynamics) stratified by education level and school type (regular vs specialized).
  • Define the primary outcome(s) to be modeled in later stages (e.g., daily absenteeism rate, log-odds transformation, or count models with offsets) and pre-specify inclusion rules (regular panel vs full sample).
  • Produce a concise set of “model-readiness” diagnostics: distributional checks, influence of extreme observations, and sensitivity to alternative denominators (e.g., excluding justified absences from the numerator, if conceptually required).

10.3 Data enrichment opportunities for future research

Beyond the exploratory analysis, the dataset can be significantly enriched through linkage with external sources to support more substantive research questions. Potential extensions include:

  • Educational outcomes (IDEB / proficiency metrics): link school identifiers to IDEB (or other standardized performance measures) to study whether persistent absenteeism patterns correlate with learning outcomes and school performance profiles. This enables descriptive benchmarking and, later, quasi-experimental designs if policy timing or shocks can be identified.

  • Local epidemiological cycles (DATASUS): merge school-day observations with municipality/region-level indicators of respiratory infections or other relevant morbidity proxies (e.g., influenza-like illness, hospital admissions). This supports tests of whether absenteeism spikes track epidemiological dynamics, with appropriate lags and seasonal controls.

  • Air pollution and exposure proxies: integrate pollution measures (monitoring stations, modeled surfaces, or satellite-derived products when available). In the absence of direct pollution series, construct exposure proxies using roadway density and traffic-related indicators (e.g., kernel density of major roads) combined with meteorological controls, acknowledging measurement error explicitly.

  • Urban greenness (vegetation indices and land cover): attach school-level greenness measures (e.g., NDVI buffers, land-cover shares, canopy cover when available) to assess whether environmental amenities correlate with attendance patterns. This is particularly relevant for spatial heterogeneity and for studies of environmental inequality.

  • Thermal comfort and heat/dryness extremes (climate exposure): integrate meteorological data to capture heat stress and thermal discomfort as potential drivers of absenteeism. This can be operationalized using daily temperature (mean/max), humidity, and derived indices such as heat index / apparent temperature or UTCI (when inputs are available). In Brasília/DF, dryness and large diurnal ranges are also relevant; therefore, including relative humidity, precipitation, and drought proxies can help capture discomfort associated with low humidity and sustained dry spells. These linkages enable analyses of short-run effects (same-day and lag structures), threshold behavior (extreme heat days), and interactions with school type and urban exposure (e.g., greenness and road density).

These enrichments would allow the transition from descriptive exploration to hypothesis-driven analyses, while preserving transparency about data linkage assumptions, spatial/temporal alignment, and potential confounding.