School absenteeism is a central concern in educational research and public policy, as it is closely associated with learning outcomes, student engagement, and longer-term socioeconomic trajectories. Understanding its temporal dynamics and institutional heterogeneity is particularly relevant in large urban education systems, where schools operate under diverse contextual and organizational conditions.
This report presents an initial exploratory analysis of a large administrative dataset covering daily attendance records for public schools in the Federal District of Brazil. The purpose of this document is to assess the structure, coverage, and internal consistency of the data, as well as to provide preliminary descriptive evidence on patterns of absenteeism across schools and over time.
The analysis conducted here is intentionally descriptive and diagnostic in nature. No causal claims are made, and no multivariate modeling is attempted at this stage. Instead, the focus is on identifying key data features, potential quality issues, and broad empirical regularities that will inform subsequent data-cleaning procedures and the design of future analytical models.
This exploratory step is intended to support methodological decisions in the next phases of the research, including variable construction, sample definition, and the choice of appropriate empirical strategies.
The dataset is an administrative extract of daily attendance records for public schools in the Federal District of Brazil. The unit of analysis is the school-day. The table below summarizes coverage.
| Start Date | End Date | Unique Days | Unique Schools | Total Observations |
|---|---|---|---|---|
| 2016-01-01 | 2019-12-31 | 1,461 | 632 | 498,677 |
Absenteeism rate:
total_faltas / (total_presencas + total_faltas + total_justificadas)
at the school-day level.
For comparability across schools with different reporting practices, justified absences are kept in the denominator, since they remain part of the recorded attendance accounting and define total observed student-days.
For economic and epidemiological interpretations, an alternative “effective absence” measure can also be defined by treating justified absences as non-attendance: absent_effective = (total_faltas + total_justificadas) / (total_presencas + total_faltas + total_justificadas). This alternative aligns more closely with interpretations related to learning time loss and illness-related absenteeism; however, it is not explored in this report and is reserved for subsequent model-based and robustness analyses.
Sample handling distinguishes regular panel coverage from specialized or limited-coverage institutions, as detailed in later sections.
The dataset analyzed in this report consists of a large administrative panel with daily attendance information at the school level for public schools in the Federal District of Brazil. Each observation corresponds to a specific school-day combination, allowing the analysis of both temporal variation within schools and cross-sectional heterogeneity across institutions.
In total, the dataset contains 498,677 observations and 23 variables. The variables can be grouped into four broad categories: (i) school identifiers and institutional characteristics, (ii) educational and administrative classifications, (iii) attendance-related counts, and (iv) temporal and spatial information. Most institutional variables are constant within schools, while attendance measures vary daily.
From a structural perspective, the unit of observation and the panel nature of the data are well defined. However, several data-quality issues are immediately apparent and are documented below, as they have direct implications for subsequent data preparation and modeling decisions.| Variable | Type | Missing |
|---|---|---|
| …1 | numeric | 0 |
| code_school | numeric | 542 |
| name_school | character | 542 |
| education_level | character | 542 |
| education_level_others | character | 151308 |
| admin_category | character | 542 |
| address | character | 542 |
| government_level | character | 542 |
| regulated_education_council | character | 542 |
| service_restriction | character | 542 |
| size | character | 542 |
| urban | character | 542 |
| location_type | character | 542 |
| date_update | Date | 542 |
| dia_data | Date | 0 |
| cep_escola | numeric | 2724 |
| count | numeric | 0 |
| ano_nascimento | numeric | 542 |
| total_faltas | numeric | 0 |
| total_presencas | numeric | 0 |
| total_justificadas | numeric | 0 |
| data | Date | 0 |
| geom | character | 0 |
The school identifier (code_school) is currently stored
as a numeric variable. Given its role as a categorical identifier rather
than a quantitative measure, this variable should be converted to
character format to avoid precision loss and unintended behavior in
joins or grouping operations. Additionally, an index-like variable
(...1) appears to be a residual artifact from data import
procedures and does not carry analytical relevance.
The dataset includes two date variables (dia_data and
data), both stored as Date objects. Their
coexistence suggests redundancy or inconsistent naming across data
sources. A single reference date variable will need to be selected after
verifying their equivalence.
A more serious issue is observed in the variable labeled
ano_nascimento, which contains values incompatible with
valid year formats. This strongly suggests data corruption,
misclassification, or the unintended inclusion of an anonymized
identifier. Given its current state, this variable is excluded from the
initial exploratory analysis until further clarification is
obtained.
Attendance-related variables (total_presencas,
total_faltas, and total_justificadas) are well
defined as numeric counts and form the analytical core of this study.
These variables are conceptually consistent and vary at the school-day
level, making them suitable for both descriptive analysis and subsequent
modeling of absenteeism dynamics.
An additional count variable (count) appears to
represent the total number of enrolled or expected students per
school-day, although its precise definition will require confirmation.
Together, these variables enable the construction of derived indicators
such as absenteeism rates and adjusted attendance measures.
Spatial information is provided through a geometry variable
(geom), currently stored as a character string containing
coordinate pairs. In its present format, this variable is not directly
compatible with standard spatial analysis workflows in R (e.g.,
sf objects). Spatial processing is therefore deferred to
later stages, following appropriate parsing and validation of the
coordinate information.
Before proceeding to descriptive analyses, a structured data-cleaning and preprocessing step was conducted to establish a coherent and analytically reliable baseline. Given the administrative origin and large scale of the dataset, this process prioritized clarity of identifiers, consistency of attendance measures, consolidation of temporal information, and validation of spatial metadata.
The purpose of this section is not to document all possible future transformations, but to explicitly describe the initial decisions required to make the dataset suitable for systematic exploration and subsequent modeling. All transformations applied at this stage are transparent, limited in scope, and reversible.
| code_school | name_school | education_level | education_level_others | admin_category | address | government_level | regulated_education_council | service_restriction | size | urban | location_type | dia_data | cep_escola | count | ano_nascimento | total_faltas | total_presencas | total_justificadas | data | geom | total_attendance | longitude | latitude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-03 | 70200620 | 2110 | 1996.365 | 1335 | 770 | 5 | 2018-04-03 | c(-47.8821157, -15.8094245) | 2110 | -47.88212 | -15.80942 |
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-04 | 70200620 | 1480 | 1993.818 | 880 | 585 | 15 | 2018-04-04 | c(-47.8821157, -15.8094245) | 1480 | -47.88212 | -15.80942 |
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-06 | 70200620 | 1675 | 1994.179 | 1130 | 535 | 10 | 2018-04-06 | c(-47.8821157, -15.8094245) | 1675 | -47.88212 | -15.80942 |
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-10 | 70200620 | 2120 | 1996.387 | 1370 | 735 | 15 | 2018-04-10 | c(-47.8821157, -15.8094245) | 2120 | -47.88212 | -15.80942 |
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-11 | 70200620 | 1490 | 1993.732 | 815 | 640 | 35 | 2018-04-11 | c(-47.8821157, -15.8094245) | 1490 | -47.88212 | -15.80942 |
| 53000234 | CEJA ASA SUL - CESAS | Educação de Jovens Adultos | Atendimento Educacional Especializado | Pública | QUADRA SGAS 602, PROJECAO D. ASA SUL. 70200-620 Brasília - DF. | Estadual | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATENDIMENTO | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização diferenciada | 2018-04-13 | 70200620 | 1715 | 1994.157 | 1130 | 555 | 30 | 2018-04-13 | c(-47.8821157, -15.8094245) | 1715 | -47.88212 | -15.80942 |
At this stage, non-informative artifacts generated during data import were removed, and key identifiers were standardized to character format to avoid numerical precision issues. A derived attendance measure was constructed to facilitate internal consistency checks between reported presences, absences, and total counts.
Spatial information was initially provided as a character string containing coordinate pairs. These values were parsed into numeric longitude and latitude variables, enabling basic validation while postponing full spatial object construction to a later analytical stage.
| code_school | cep_escola | geom |
|---|---|---|
| 53005341 | 73030050 | c(NaN, NaN) |
| 53005767 | 73151010 | c(NaN, NaN) |
| 53008944 | 72301700 | c(NaN, NaN) |
| 53008979 | 72669425 | c(NaN, NaN) |
| 53011066 | 72600412 | c(NaN, NaN) |
| 53012127 | 72631115 | c(NaN, NaN) |
| 53012666 | 73307994 | c(NaN, NaN) |
| NA | NA | c(NA, NA) |
A small subset of schools presented missing spatial coordinates after the initial parsing step. For these cases, latitude and longitude values were recovered using externally verified school-level location information.
This imputation procedure was limited to a clearly identified and small number of institutions and does not rely on statistical interpolation. Instead, it represents a deterministic correction of missing metadata based on external sources. All imputed values are explicitly documented and reproducible.
Finally, the cleaned dataset was converted into a spatial object using a standard geographic coordinate reference system (WGS 84). This transformation enables future spatial analyses while preserving the original coordinate variables for transparency and compatibility with non-spatial workflows.
Overall, the resulting dataset provides a consistent and well-documented foundation for exploratory analysis of school absenteeism at high temporal resolution. The cleaning steps undertaken here explicitly address identifiable data-quality issues while maintaining a cautious and modular approach. Subsequent sections build on this baseline to examine temporal coverage and descriptive patterns, with these preprocessing decisions kept explicitly in view.
The checks below formalize basic invariants and summarize the impact of cleaning decisions. Each item reports the count and share of affected rows.
| Check | Affected Rows | Percent | Action |
|---|---|---|---|
| Rows in raw dataset | 498,677 | 100% | Baseline |
| Rows after cleaning | 498,135 | 99.89% | Derived after cleaning |
| Rows with missing coordinates (pre-imputation) | 8 | 0% | Imputed deterministically |
| Rows imputed (coordinates) | 7 | 0% | Imputed deterministically |
| Rows dropped after coordinate filter | 542 | 0.11% | Dropped if still missing |
| Attendance identity mismatch | 0 | 0% | Flagged for review |
| Non-negative attendance violations | 0 | 0% | Flagged for review |
| Absenteeism rate outside [0, 1] | 0 | 0% | Flagged for review |
| Duplicate key rows | 4,666 | 0.94% | Flagged for review |
This section describes the effective temporal and institutional coverage of the cleaned dataset. The objective is to document the observation window, the number of schools represented, and the continuity of observations over time, without introducing substantive interpretation of outcomes.
Understanding these dimensions is essential to contextualize subsequent descriptive results and to assess whether the data structure supports different empirical strategies in later stages.| Start Date | End Date | Unique Days | Unique Schools | Total Observations |
|---|---|---|---|---|
| 2016-01-04 | 2019-12-20 | 1,011 | 631 | 498,135 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1 | 204 | 211 | 212.98 | 221 | 251 |
| Percentile | Days Observed |
|---|---|
| 5% | 200 |
| 25% | 204 |
| 50% | 211 |
| 75% | 221 |
| 95% | 235 |
Most schools exhibit substantial temporal coverage, with a median number of observed days that represents a large fraction of the total observation window. A smaller subset of schools presents more limited coverage, which will require attention in subsequent analyses, particularly when constructing longitudinal comparisons or defining analytical samples.
The distribution of the number of observed days per school indicates that the panel is not perfectly balanced. While many schools are observed for a large share of the total period, others enter or exit the dataset at different points in time. This unbalanced structure is typical of administrative education data and reflects institutional openings, closures, reporting gaps, or changes in operational status.
The distribution of the number of observed days per school indicates that the panel is largely well covered, with most institutions reporting attendance information for a substantial portion of the academic year. Median coverage exceeds two hundred school days, suggesting high temporal continuity for the majority of schools in the dataset.
A small number of schools exhibit markedly lower numbers of observed days. Closer inspection reveals that these cases are not random reporting failures, but are predominantly associated with institutions providing specialized or complementary educational services, such as special education and targeted support programs. These units typically operate under distinct organizational arrangements and do not follow the standard daily attendance structure of regular schools.
As a result, the presence of schools with limited temporal coverage reflects institutional heterogeneity rather than data-quality deficiencies. This distinction is important for subsequent analyses, as it motivates differentiated treatment of school types and careful definition of analytical samples when constructing longitudinal comparisons or aggregate indicators.
| year(data) | code_school | education_level | education_level_others | n_days_obs | first_date | last_date |
|---|---|---|---|---|---|---|
| 2018 | 53000234 | Educação de Jovens Adultos | Atendimento Educacional Especializado | 161 | 2018-02-16 | 2018-12-21 |
| 2016 | 53001451 | Ensino Fundamental | Atendimento Educacional Especializado | 171 | 2016-02-29 | 2016-12-23 |
| 2016 | 53001486 | Ensino Fundamental | Atendimento Educacional Especializado | 163 | 2016-02-29 | 2016-12-23 |
| 2016 | 53001516 | Ensino Fundamental | Atendimento Educacional Especializado | 161 | 2016-02-29 | 2016-12-23 |
| 2016 | 53001540 | Ensino Fundamental | NA | 180 | 2016-02-29 | 2016-12-28 |
| 2016 | 53001630 | Ensino Fundamental | Atendimento Educacional Especializado | 163 | 2016-02-29 | 2016-12-21 |
| 2017 | 53001737 | Ensino Fundamental | Atendimento Educacional Especializado | 8 | 2017-02-10 | 2017-02-21 |
| 2016 | 53002270 | Educação Infantil | NA | 2 | 2016-02-29 | 2016-03-01 |
| 2018 | 53007743 | Ensino Fundamental | NA | 1 | 2018-02-15 | 2018-02-15 |
| 2016 | 53008839 | Ensino Fundamental | NA | 8 | 2016-02-29 | 2016-03-09 |
| 2016 | 53008847 | Ensino Fundamental | Atendimento Educacional Especializado | 171 | 2016-02-29 | 2016-12-23 |
| 2016 | 53012798 | Ensino Fundamental | NA | 169 | 2016-02-29 | 2016-12-22 |
| 2016 | 53013557 | Ensino Fundamental | Atendimento Educacional Especializado | 3 | 2016-02-29 | 2016-03-02 |
| 2017 | 53017277 | Educação Infantil | NA | 15 | 2017-05-02 | 2017-07-03 |
| 2017 | 53017323 | Educação Infantil | NA | 168 | 2017-04-03 | 2017-12-21 |
| 2017 | 53068165 | Ensino Fundamental | Atendimento Educacional Especializado, Atividade Complementar | 1 | 2017-02-10 | 2017-02-10 |
| Percentile | Active Schools |
|---|---|
| 5% | 53.5 |
| 25% | 516.0 |
| 50% | 562.0 |
| 75% | 602.0 |
| 95% | 627.0 |
Coverage diagnostics: active schools by day and observed days per school-year
Coverage diagnostics: active schools by day and observed days per school-year
The number of active schools varies over time, indicating changes in institutional participation or reporting. This variation further reinforces the unbalanced nature of the panel and motivates careful consideration of sample definitions in later stages of the analysis.
Overall, the dataset provides broad temporal coverage and includes a large number of schools observed at high frequency. While the panel is unbalanced, its structure is consistent with the characteristics of large-scale administrative education data. These features are taken into account in the descriptive analyses that follow and will inform modeling choices in subsequent phases of the research.
This section presents an initial descriptive characterization of school absenteeism using the cleaned dataset. The objective is to establish baseline levels, dispersion, and heterogeneity across schools, without introducing causal interpretation or model-based inference.
Given the institutional heterogeneity documented in the previous section, descriptive results are presented for the full sample and, when relevant, with explicit reference to differences between regular schools and institutions providing specialized or complementary educational services.
Absenteeism is measured as the ratio between total absences and total recorded attendance on each school-day. This measure provides a scale-free indicator that facilitates comparison across institutions of different sizes.| Sample Group | Schools | Median Days per Year | Mean Absenteeism | P25 | P75 |
|---|---|---|---|---|---|
| Regular panel | 631 | 211 | 10.7% | 10.7% | 10.7% |
| Specialized/limited coverage | 16 | 161 | 15.1% | 15.1% | 15.1% |
| Mean | Median | Std. Dev. | 25th Percentile | 75th Percentile |
|---|---|---|---|---|
| 0.108 | 0.0946 | 0.0744 | 0.0617 | 0.1392 |
At the aggregate level, absenteeism rates exhibit substantial dispersion, indicating marked heterogeneity in attendance behavior across schools and over time. Median values provide a useful benchmark, as the distribution is skewed by extreme observations.
The distribution of absenteeism rates is right-skewed, with a
concentration of observations at relatively low levels and a long upper
tail. This pattern is consistent with episodic spikes in absences,
potentially driven by institutional, temporal, or contextual
factors.
| education_level | n | mean | median | p25 | p75 |
|---|---|---|---|---|---|
| Educação Infantil | 47730 | 0.1536 | 0.1402 | 0.1085 | 0.1805 |
| Educação Infantil, Ensino Fundamental | 140149 | 0.0956 | 0.0865 | 0.0604 | 0.1176 |
| Educação Infantil, Ensino Fundamental, Educação de Jovens Adultos | 7031 | 0.1295 | 0.1111 | 0.0717 | 0.1710 |
| Educação Infantil, Ensino Fundamental, Ensino Médio | 1729 | 0.0555 | 0.0532 | 0.0295 | 0.0751 |
| Educação Infantil, Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos | 4132 | 0.1257 | 0.1182 | 0.0800 | 0.1641 |
| Educação de Jovens Adultos | 815 | 0.4680 | 0.4804 | 0.4038 | 0.5382 |
| Ensino Fundamental | 177589 | 0.0867 | 0.0764 | 0.0509 | 0.1085 |
| Ensino Fundamental, Educação de Jovens Adultos | 43992 | 0.1289 | 0.1226 | 0.0793 | 0.1694 |
| Ensino Fundamental, Ensino Médio | 14133 | 0.1226 | 0.1144 | 0.0754 | 0.1590 |
| Ensino Fundamental, Ensino Médio, Educação Profissional | 894 | 0.0904 | 0.0891 | 0.0579 | 0.1185 |
| Ensino Fundamental, Ensino Médio, Educação de Jovens Adultos | 18845 | 0.1355 | 0.1291 | 0.0877 | 0.1788 |
| Ensino Médio | 23538 | 0.1335 | 0.1343 | 0.0820 | 0.1781 |
| Ensino Médio, Educação Profissional | 1299 | 0.0757 | 0.0588 | 0.0218 | 0.1010 |
| Ensino Médio, Educação de Jovens Adultos | 16259 | 0.1528 | 0.1463 | 0.0952 | 0.1996 |
| education_level_others | n | mean | median | p25 | p75 |
|---|---|---|---|---|---|
| Atendimento Educacional Especializado | 211721 | 0.1057 | 0.0912 | 0.0598 | 0.1358 |
| Atendimento Educacional Especializado, Atividade Complementar | 91572 | 0.1051 | 0.0945 | 0.0623 | 0.1359 |
| Atividade Complementar | 44076 | 0.1104 | 0.0957 | 0.0615 | 0.1422 |
| NA | 150766 | 0.1122 | 0.0992 | 0.0645 | 0.1447 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0.0334 | 0.0811 | 0.0993 | 0.1088 | 0.1295 | 0.5104 |
When aggregating absenteeism rates at the school level, substantial heterogeneity becomes evident. While many schools exhibit relatively low average absenteeism, a non-negligible subset displays persistently higher levels, reinforcing the importance of accounting for institutional differences in subsequent analyses. The dispersion observed across schools highlights that absenteeism is not uniformly distributed within the education system. This heterogeneity motivates stratified analyses and the exploration of covariates that may explain persistent differences across institutions.
Distribution of Daily Absenteeism Rates by Education Level
The figure above presents the distribution of daily absenteeism rates by education level. Clear differences across modalities are observed. Early childhood education tends to exhibit lower and more concentrated absenteeism rates, while secondary education displays greater dispersion.
Education of Young Adults (EJA) shows systematically higher absenteeism levels and a distribution shifted toward larger values. This pattern is consistent with the specific attendance dynamics of this modality, which typically involves older students, flexible schedules, and different engagement constraints.
Combined education levels and specialized modalities present wider and less regular distributions, reflecting their heterogeneous institutional arrangements and limited temporal coverage. These differences reinforce the importance of distinguishing education levels when analyzing absenteeism patterns and motivate stratified or modality-specific analyses in subsequent stages.
Overall, the descriptive evidence presented in this section establishes key empirical regularities of school absenteeism at high temporal resolution. The observed levels and dispersion provide a factual baseline for the analysis of temporal patterns and institutional correlates explored in the next sections.
To characterize the temporal dynamics of absenteeism, we first aggregate the data at different time scales: daily, monthly, and by day of the week.
We visualize these aggregated patterns to identify trends, seasonality, and weekly cycles.
Key temporal findings: - Absenteeism shows clear seasonality across months with recurring peaks and troughs. - Weekly cycles are evident, with systematic differences by day of week. - The smoothed daily series indicates gradual trend variation rather than abrupt breaks.
Cautions: - Changes in the number of active schools over time can mechanically shift aggregate rates. - Calendar effects (school holidays and administrative reporting gaps) may influence daily dynamics.
The map below displays the spatial distribution of average absenteeism rates across schools. No spatial statistics are computed at this stage; the goal is to identify potential spatial clusters or gradients visually.