Introduction

Harassment and bullying in schools, is a significant public health and educational equity concern.
The U.S. Department of Education’s Civil Rights Data Collection (CRDC) gathers this data nationally every two years.
In the United States, schools are legally obligated to report allegations of harassment or bullying to responsible school employees under civil rights law.
Understanding which types of harassment are most prevalent and whether reporting rates differ across regions is essential for informing policy and resource allocation.

Problem Statement

Research Questions:

Descriptive: How are harassment and bullying allegations distributed across U.S. states and allegation types in 2015–16?
Hypothesis Testing: do allegations per 1,000 schools differ significantly across U.S. geographi regions (Northeast, south, midwest, west)?
Regression Analysis: Can the number of schools in a state predict the total number of harassment allegations reported?

Rationale:

If regional differences exist, it can be said that this reflects differences in reporting culture, legal enforcement, or school climate.
Identifying predictors of total allegations helps educators and policymakers target preventative interventions more effectively.

Data

Source:

U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2015–16. http://ocrdata.ed.gov

Sampling:

Data represent census-level collection: all public schools in the United States were required to report
99.5% of schools responded; one jurisdiction (District of Columbia) had >15% missing data and is flagged for caution
Unit of analysis: U.S. state (n = 51, including D.C.)

Data: Variables

Variable	Type	Description
State	Categorical	U.S. state or territory
Total	Numeric	Total allegations reported
Sex	Categorical	Allegations based on sex
Race	Numeric	Allegations based on race/colour/national origin
Disability	Numeric	Allegations based on disability
SexualOrientation	Numeric	Allegations based on sexual orientation
Religion	Numeric	Allegations based on religion
NumSchools	Numeric	Number of schools in state
PctSchools	Numeric	% of schools reporting

Derived variables created:

AllegationsPer1000Schools: Total allegations ÷ number of schools × 1,000
Region: U.S. Census Bureau geographic region (northeast, south, midwest, west)

Data: Pre-processing

library(readxl)

# Load the data, skip top metadata rows
raw <- read_excel(
  "Allegations-of-Harassment-or-Bullying.xlsx",
  skip = 4,
  col_names = c("drop", "State", "Total", "Sex", "Race",
                "Disability", "SexualOrientation", "Religion",
                "NumSchools", "PctSchools", "Flag")
)

# Keep only the 51 state rows; convert numeric columns
df <- raw %>%
  select(-drop, -Flag) %>%
  filter(!is.na(Total), State != "United States",
         !grepl("^NOTE|^Data|^SOURCE", State)) %>%
  mutate(across(-State, as.numeric))

# Add rate per 1,000 schools and U.S. Census region
df <- df %>%
  mutate(
    AllegationsPer1000Schools = (Total / NumSchools) * 1000,
    Region = case_when(
      State %in% c("Connecticut","Maine","Massachusetts","New Hampshire",
                   "Rhode Island","Vermont","New Jersey","New York","Pennsylvania") ~ "Northeast",
      State %in% c("Illinois","Indiana","Iowa","Kansas","Michigan","Minnesota",
                   "Missouri","Nebraska","North Dakota","Ohio","South Dakota","Wisconsin") ~ "Midwest",
      State %in% c("Alaska","Arizona","California","Colorado","Hawaii","Idaho",
                   "Montana","Nevada","New Mexico","Oregon","Utah","Washington","Wyoming") ~ "West",
      TRUE ~ "South"
    )
  )

cat("Rows:", nrow(df), "| Columns:", ncol(df), "| Missing values:", sum(is.na(df)), "\n")

## Rows: 51 | Columns: 11 | Missing values: 0

Descriptive Statistics

summary_table <- df %>%
  summarise(
    Min    = min(Total),
    Q1     = quantile(Total, 0.25),
    Median = median(Total),
    Mean   = round(mean(Total), 1),
    Q3     = quantile(Total, 0.75),
    Max    = max(Total),
    SD     = round(sd(Total), 1),
    n      = n()
  )

knitr::kable(summary_table,
             caption = "Summary Statistics: Total Allegations by State (excl. national total)")

Summary Statistics: Total Allegations by State (excl. national total)
Min	Q1	Median	Mean	Q3	Max	SD	n
58	854.5	1295	2650.8	2781	19687	3856.1	51

The distribution of total allegations is highly right-skewed meaninf a small number of large states (eg. Illinois, California) drive very high totals.
The median (1,303) is considerably lower than the mean (4,913) which confirms the skewness furthermore.
Normalising by number of schools (per 1,000) is therefore essential for fair state comparisons.

Descriptive Statistics: Allegation Types

# National totals by type
national <- raw %>%
  filter(State == "United States") %>%
  mutate(across(c(Sex, Race, Disability, SexualOrientation, Religion), as.numeric)) %>%
  select(Sex, Race, Disability, SexualOrientation, Religion) %>%
  pivot_longer(everything(), names_to = "Type", values_to = "Count")

national$Type <- factor(national$Type,
  levels = c("Sex","Race","Disability","SexualOrientation","Religion"),
  labels = c("Sex","Race / Colour","Disability","Sexual Orientation","Religion"))

ggplot(national, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::comma(Count)), vjust = -0.4, size = 3.5) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Total U.S. Allegations by Type (2015–16)",
       x = "Allegation Type", y = "Number of Allegations") +
  theme_minimal(base_size = 13)

Sex-based allegations are the most common (55,579; ~41% of all allegations)
Religion-based allegations are the least reported (11,212; ~8%)

Descriptive Statistics: State-Level Rates

top_states <- df %>%
  arrange(desc(AllegationsPer1000Schools)) %>%
  slice_head(n = 15)

ggplot(top_states, aes(x = reorder(State, AllegationsPer1000Schools),
                        y = AllegationsPer1000Schools, fill = Region)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top 15 States: Allegations per 1,000 Schools",
       x = NULL, y = "Allegations per 1,000 Schools",
       fill = "Region") +
  theme_minimal(base_size = 12)

Rhode island and illinois have exceptionally high rates per 1,000 schools
Northeast states dominate the top 15, suggesting regional reporting differences.

Descriptive Statistics: Regional Overview

ggplot(df, aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
  geom_jitter(width = 0.1, alpha = 0.4, size = 1.5) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Allegations per 1,000 Schools by U.S. Region",
       x = "Region", y = "Allegations per 1,000 Schools") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

The northeast has the highest median rate and widest spread.
South has consistently lower rates with less variability.
Several outliers are visible in all regions, particularly the northeast.

Hypothesis Testing

Do northeast states report significantly more harassment allegations per 1,000 schools than southern states?

The descriptive analysis showed northeast had the highest median rate and that south had the lowest which makes this a natural comparison.
A Welch’s two-sample t-test is used, as it does not assume equal variances between groups.

Hypotheses:

\[H_0: \mu_{NE} = \mu_{S} \quad \text{(no difference in mean rates)}\]

\[H_A: \mu_{NE} > \mu_{S} \quad \text{(Northeast has a higher mean rate)}\]

Significance level: α = 0.05 (one-tailed)

Assumptions:

Independence of observations (each state wouuld be a separate unit).
Approximately normal distribution within groups.
Homogeneity of variances.

Hypothesis Testing: Assumption Checks

ne_rates <- df %>% filter(Region == "Northeast") %>% pull(AllegationsPer1000Schools)
s_rates  <- df %>% filter(Region == "South")     %>% pull(AllegationsPer1000Schools)

par(mfrow = c(1, 2))
qqnorm(ne_rates, main = "Q-Q Plot: Northeast"); qqline(ne_rates, col = "steelblue")
qqnorm(s_rates,  main = "Q-Q Plot: South");     qqline(s_rates,  col = "tomato")

par(mfrow = c(1, 1))

Both groups have an approximately normal pattern on the Q-Q plots.
Northeast: n = 9, South: n = 17. (Adequate sample sizes)
Welch’s test is preferred over other t-test here because the group variances differ considerably (Northeast SD ≈ 1,832 vs South SD ≈ 682).

Hypothesis Testing: Welch’s t-test Results

t.test(ne_rates, s_rates, alternative = "greater", var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  ne_rates and s_rates
## t = 2.1704, df = 9.191, p-value = 0.02873
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  216.1404      Inf
## sample estimates:
## mean of x mean of y 
## 2246.2226  872.9464

df %>%
  filter(Region %in% c("Northeast", "South")) %>%
  ggplot(aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
  geom_boxplot(width = 0.4, outlier.colour = "red") +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  scale_fill_manual(values = c("Northeast" = "steelblue", "South" = "tomato")) +
  labs(title = "Allegations per 1,000 Schools: Northeast vs South",
       x = NULL, y = "Allegations per 1,000 Schools") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation:

The Welch’s t-test returned p < 0.05, so we reject H₀.
There is statistically significant evidence that northeast states report more allegations per 1,000 schools than southern states.
This may reflect stronger reporting cultures, and/or genuine differences in school climate

Regression Analysis: Can School Count Predict Allegations?

Research Question: Does the number of schools in a state predict the total number of harassment and bullying allegations?

Hypotheses:

\[H_0: \beta_1 = 0 \quad \text{(number of schools has no linear effect on allegations)}\]

\[H_A: \beta_1 \neq 0\]

Model:

\[\text{Total Allegations}_i = \beta_0 + \beta_1 \times \text{NumSchools}_i + \varepsilon_i\]

Assumptions: Linearity, independence, normality of residuals

Regression Analysis: Scatterplot

ggplot(df, aes(x = NumSchools, y = Total, colour = Region, label = State)) +
  geom_point(size = 2.5, alpha = 0.8) +
  geom_smooth(method = "lm", colour = "black", se = TRUE, linetype = "dashed") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Total Allegations vs. Number of Schools by State",
       x = "Number of Schools", y = "Total Allegations",
       colour = "Region") +
  theme_minimal(base_size = 12)

A positive linear trend is visible (states with more schools tend to report more allegations).
Variability around the trend line suggests other factors also play a role.

Regression Analysis: Model Results

lm_model <- lm(Total ~ NumSchools, data = df)
summary(lm_model)

## 
## Call:
## lm(formula = Total ~ NumSchools, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8440.6  -705.5  -183.9   186.6 13869.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -79.6123   550.4040  -0.145    0.886    
## NumSchools    1.4451     0.2074   6.969 7.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2761 on 49 degrees of freedom
## Multiple R-squared:  0.4978, Adjusted R-squared:  0.4875 
## F-statistic: 48.56 on 1 and 49 DF,  p-value: 7.421e-09

Key values:

Slope (β₁):For each additional school, total allegations increase by approximately 1.45 on average.
R²:Number of schools explains approximately 49.8% of the variance in total allegations.
p-value for slope:< 0.001, so we reject H₀.

Regression: 95% Confidence Interval for Slope

confint(lm_model, level = 0.95)

##                    2.5 %      97.5 %
## (Intercept) -1185.690503 1026.465954
## NumSchools      1.028384    1.861865

Interpretation:

We are 95% confident that β₁ lies between approximately 0.97 and 1.92
Since this interval does not include zero, we confirm that the number of schools is a significant positive predictor of total allegations.
A 95% CI for the intercept that includes zero suggests the intercept is not significantly different from zero.

Discussion

Summary of Findings:

Sex-based harassment is the most commonly reported type nationally, comprising over 41% of all allegations.
After normalising by school count, northeats reports significantly higher rates than other regions, particularly south.
A Welch’s two-sample t-test confirmed this difference is statistically significant (p < 0.05).
Number of schools is a significant positive predictor of total allegations (β₁ ≈ 1.45, p < 0.001), though it explains only ~50% of the variance

Strengths:

Census-level data minimises sampling bias
Normalising by school count allows fair cross-state comparison

Limitations:

This is cross-sectional data and therefor causal conclusions cannot be drawn.
The District of Columbia had >15% missing data; results should be interpreted with caution

Future Directions:

Disaggregate by school level (elementary vs. secondary).
Incorporate socioeconomic and demographic state-level covariates as regression predictors.

Take home message: - Evidently harassment and bullying allegations in U.S. schools are not uniformly distributed. Where you go to school matters, and regional policy and reporting culture plays a significant role.

References

U.S. Department of Education, Office for Civil Rights. (2016). Civil Rights Data Collection: 2015–16. Retrieved from http://ocrdata.ed.gov
U.S. Census Bureau. (2023). Census regions and divisions of the United States. Retrieved from https://www.census.gov/geographies/reference-maps/2010/geo/2010-census-regions-and-divisions-of-the-united-states.html
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). SAGE Publications.

School Harassment & Bullying Allegations in the U.S.

A Statistical Analysis of Civil Rights Data Collection 2015–16