School Harassment & Bullying Allegations in the U.S.

A Statistical Analysis of Civil Rights Data Collection 2015–16

B. Sachini S. Perera | s4217237

Last updated:30.05.2026

Introduction

Problem Statement

Research Questions:

  1. Descriptive: How are harassment and bullying allegations distributed across U.S. states and allegation types in 2015–16?

  2. Hypothesis Testing: do allegations per 1,000 schools differ significantly across U.S. geographi regions (Northeast, south, midwest, west)?

  3. Regression Analysis: Can the number of schools in a state predict the total number of harassment allegations reported?

Rationale:

Data

Source:

U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2015–16. http://ocrdata.ed.gov

Sampling:

Data: Variables

Variable Type Description
State Categorical U.S. state or territory
Total Numeric Total allegations reported
Sex Categorical Allegations based on sex
Race Numeric Allegations based on race/colour/national origin
Disability Numeric Allegations based on disability
SexualOrientation Numeric Allegations based on sexual orientation
Religion Numeric Allegations based on religion
NumSchools Numeric Number of schools in state
PctSchools Numeric % of schools reporting

Derived variables created:

Data: Pre-processing

library(readxl)

# Load the data, skip top metadata rows
raw <- read_excel(
  "Allegations-of-Harassment-or-Bullying.xlsx",
  skip = 4,
  col_names = c("drop", "State", "Total", "Sex", "Race",
                "Disability", "SexualOrientation", "Religion",
                "NumSchools", "PctSchools", "Flag")
)

# Keep only the 51 state rows; convert numeric columns
df <- raw %>%
  select(-drop, -Flag) %>%
  filter(!is.na(Total), State != "United States",
         !grepl("^NOTE|^Data|^SOURCE", State)) %>%
  mutate(across(-State, as.numeric))

# Add rate per 1,000 schools and U.S. Census region
df <- df %>%
  mutate(
    AllegationsPer1000Schools = (Total / NumSchools) * 1000,
    Region = case_when(
      State %in% c("Connecticut","Maine","Massachusetts","New Hampshire",
                   "Rhode Island","Vermont","New Jersey","New York","Pennsylvania") ~ "Northeast",
      State %in% c("Illinois","Indiana","Iowa","Kansas","Michigan","Minnesota",
                   "Missouri","Nebraska","North Dakota","Ohio","South Dakota","Wisconsin") ~ "Midwest",
      State %in% c("Alaska","Arizona","California","Colorado","Hawaii","Idaho",
                   "Montana","Nevada","New Mexico","Oregon","Utah","Washington","Wyoming") ~ "West",
      TRUE ~ "South"
    )
  )

cat("Rows:", nrow(df), "| Columns:", ncol(df), "| Missing values:", sum(is.na(df)), "\n")
## Rows: 51 | Columns: 11 | Missing values: 0

Descriptive Statistics

summary_table <- df %>%
  summarise(
    Min    = min(Total),
    Q1     = quantile(Total, 0.25),
    Median = median(Total),
    Mean   = round(mean(Total), 1),
    Q3     = quantile(Total, 0.75),
    Max    = max(Total),
    SD     = round(sd(Total), 1),
    n      = n()
  )

knitr::kable(summary_table,
             caption = "Summary Statistics: Total Allegations by State (excl. national total)")
Summary Statistics: Total Allegations by State (excl. national total)
Min Q1 Median Mean Q3 Max SD n
58 854.5 1295 2650.8 2781 19687 3856.1 51

Descriptive Statistics: Allegation Types

# National totals by type
national <- raw %>%
  filter(State == "United States") %>%
  mutate(across(c(Sex, Race, Disability, SexualOrientation, Religion), as.numeric)) %>%
  select(Sex, Race, Disability, SexualOrientation, Religion) %>%
  pivot_longer(everything(), names_to = "Type", values_to = "Count")

national$Type <- factor(national$Type,
  levels = c("Sex","Race","Disability","SexualOrientation","Religion"),
  labels = c("Sex","Race / Colour","Disability","Sexual Orientation","Religion"))

ggplot(national, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::comma(Count)), vjust = -0.4, size = 3.5) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Total U.S. Allegations by Type (2015–16)",
       x = "Allegation Type", y = "Number of Allegations") +
  theme_minimal(base_size = 13)

Descriptive Statistics: State-Level Rates

top_states <- df %>%
  arrange(desc(AllegationsPer1000Schools)) %>%
  slice_head(n = 15)

ggplot(top_states, aes(x = reorder(State, AllegationsPer1000Schools),
                        y = AllegationsPer1000Schools, fill = Region)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top 15 States: Allegations per 1,000 Schools",
       x = NULL, y = "Allegations per 1,000 Schools",
       fill = "Region") +
  theme_minimal(base_size = 12)

Descriptive Statistics: Regional Overview

ggplot(df, aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
  geom_jitter(width = 0.1, alpha = 0.4, size = 1.5) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Allegations per 1,000 Schools by U.S. Region",
       x = "Region", y = "Allegations per 1,000 Schools") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Hypothesis Testing

Do northeast states report significantly more harassment allegations per 1,000 schools than southern states?

Hypotheses:

\[H_0: \mu_{NE} = \mu_{S} \quad \text{(no difference in mean rates)}\]

\[H_A: \mu_{NE} > \mu_{S} \quad \text{(Northeast has a higher mean rate)}\]

Significance level: α = 0.05 (one-tailed)

Assumptions:

  1. Independence of observations (each state wouuld be a separate unit).
  2. Approximately normal distribution within groups.
  3. Homogeneity of variances.

Hypothesis Testing: Assumption Checks

ne_rates <- df %>% filter(Region == "Northeast") %>% pull(AllegationsPer1000Schools)
s_rates  <- df %>% filter(Region == "South")     %>% pull(AllegationsPer1000Schools)

par(mfrow = c(1, 2))
qqnorm(ne_rates, main = "Q-Q Plot: Northeast"); qqline(ne_rates, col = "steelblue")
qqnorm(s_rates,  main = "Q-Q Plot: South");     qqline(s_rates,  col = "tomato")

par(mfrow = c(1, 1))

Hypothesis Testing: Welch’s t-test Results

t.test(ne_rates, s_rates, alternative = "greater", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  ne_rates and s_rates
## t = 2.1704, df = 9.191, p-value = 0.02873
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  216.1404      Inf
## sample estimates:
## mean of x mean of y 
## 2246.2226  872.9464
df %>%
  filter(Region %in% c("Northeast", "South")) %>%
  ggplot(aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
  geom_boxplot(width = 0.4, outlier.colour = "red") +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  scale_fill_manual(values = c("Northeast" = "steelblue", "South" = "tomato")) +
  labs(title = "Allegations per 1,000 Schools: Northeast vs South",
       x = NULL, y = "Allegations per 1,000 Schools") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation:

Regression Analysis: Can School Count Predict Allegations?

Research Question: Does the number of schools in a state predict the total number of harassment and bullying allegations?

Hypotheses:

\[H_0: \beta_1 = 0 \quad \text{(number of schools has no linear effect on allegations)}\]

\[H_A: \beta_1 \neq 0\]

Model:

\[\text{Total Allegations}_i = \beta_0 + \beta_1 \times \text{NumSchools}_i + \varepsilon_i\]

Assumptions: Linearity, independence, normality of residuals

Regression Analysis: Scatterplot

ggplot(df, aes(x = NumSchools, y = Total, colour = Region, label = State)) +
  geom_point(size = 2.5, alpha = 0.8) +
  geom_smooth(method = "lm", colour = "black", se = TRUE, linetype = "dashed") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Total Allegations vs. Number of Schools by State",
       x = "Number of Schools", y = "Total Allegations",
       colour = "Region") +
  theme_minimal(base_size = 12)

Regression Analysis: Model Results

lm_model <- lm(Total ~ NumSchools, data = df)
summary(lm_model)
## 
## Call:
## lm(formula = Total ~ NumSchools, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8440.6  -705.5  -183.9   186.6 13869.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -79.6123   550.4040  -0.145    0.886    
## NumSchools    1.4451     0.2074   6.969 7.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2761 on 49 degrees of freedom
## Multiple R-squared:  0.4978, Adjusted R-squared:  0.4875 
## F-statistic: 48.56 on 1 and 49 DF,  p-value: 7.421e-09

Key values:

Regression: 95% Confidence Interval for Slope

confint(lm_model, level = 0.95)
##                    2.5 %      97.5 %
## (Intercept) -1185.690503 1026.465954
## NumSchools      1.028384    1.861865

Interpretation:

Discussion

Summary of Findings:

Strengths:

Limitations:

Future Directions:

Take home message: - Evidently harassment and bullying allegations in U.S. schools are not uniformly distributed. Where you go to school matters, and regional policy and reporting culture plays a significant role.

References