B. Sachini S. Perera | s4217237
Last updated:30.05.2026
Research Questions:
Descriptive: How are harassment and bullying allegations distributed across U.S. states and allegation types in 2015–16?
Hypothesis Testing: do allegations per 1,000 schools differ significantly across U.S. geographi regions (Northeast, south, midwest, west)?
Regression Analysis: Can the number of schools in a state predict the total number of harassment allegations reported?
Rationale:
Source:
U.S. Department of Education, Office for Civil Rights, Civil Rights Data Collection, 2015–16. http://ocrdata.ed.gov
Sampling:
| Variable | Type | Description |
|---|---|---|
| State | Categorical | U.S. state or territory |
| Total | Numeric | Total allegations reported |
| Sex | Categorical | Allegations based on sex |
| Race | Numeric | Allegations based on race/colour/national origin |
| Disability | Numeric | Allegations based on disability |
| SexualOrientation | Numeric | Allegations based on sexual orientation |
| Religion | Numeric | Allegations based on religion |
| NumSchools | Numeric | Number of schools in state |
| PctSchools | Numeric | % of schools reporting |
Derived variables created:
library(readxl)
# Load the data, skip top metadata rows
raw <- read_excel(
"Allegations-of-Harassment-or-Bullying.xlsx",
skip = 4,
col_names = c("drop", "State", "Total", "Sex", "Race",
"Disability", "SexualOrientation", "Religion",
"NumSchools", "PctSchools", "Flag")
)
# Keep only the 51 state rows; convert numeric columns
df <- raw %>%
select(-drop, -Flag) %>%
filter(!is.na(Total), State != "United States",
!grepl("^NOTE|^Data|^SOURCE", State)) %>%
mutate(across(-State, as.numeric))
# Add rate per 1,000 schools and U.S. Census region
df <- df %>%
mutate(
AllegationsPer1000Schools = (Total / NumSchools) * 1000,
Region = case_when(
State %in% c("Connecticut","Maine","Massachusetts","New Hampshire",
"Rhode Island","Vermont","New Jersey","New York","Pennsylvania") ~ "Northeast",
State %in% c("Illinois","Indiana","Iowa","Kansas","Michigan","Minnesota",
"Missouri","Nebraska","North Dakota","Ohio","South Dakota","Wisconsin") ~ "Midwest",
State %in% c("Alaska","Arizona","California","Colorado","Hawaii","Idaho",
"Montana","Nevada","New Mexico","Oregon","Utah","Washington","Wyoming") ~ "West",
TRUE ~ "South"
)
)
cat("Rows:", nrow(df), "| Columns:", ncol(df), "| Missing values:", sum(is.na(df)), "\n")## Rows: 51 | Columns: 11 | Missing values: 0
summary_table <- df %>%
summarise(
Min = min(Total),
Q1 = quantile(Total, 0.25),
Median = median(Total),
Mean = round(mean(Total), 1),
Q3 = quantile(Total, 0.75),
Max = max(Total),
SD = round(sd(Total), 1),
n = n()
)
knitr::kable(summary_table,
caption = "Summary Statistics: Total Allegations by State (excl. national total)")| Min | Q1 | Median | Mean | Q3 | Max | SD | n |
|---|---|---|---|---|---|---|---|
| 58 | 854.5 | 1295 | 2650.8 | 2781 | 19687 | 3856.1 | 51 |
# National totals by type
national <- raw %>%
filter(State == "United States") %>%
mutate(across(c(Sex, Race, Disability, SexualOrientation, Religion), as.numeric)) %>%
select(Sex, Race, Disability, SexualOrientation, Religion) %>%
pivot_longer(everything(), names_to = "Type", values_to = "Count")
national$Type <- factor(national$Type,
levels = c("Sex","Race","Disability","SexualOrientation","Religion"),
labels = c("Sex","Race / Colour","Disability","Sexual Orientation","Religion"))
ggplot(national, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = scales::comma(Count)), vjust = -0.4, size = 3.5) +
scale_y_continuous(labels = scales::comma) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Total U.S. Allegations by Type (2015–16)",
x = "Allegation Type", y = "Number of Allegations") +
theme_minimal(base_size = 13)top_states <- df %>%
arrange(desc(AllegationsPer1000Schools)) %>%
slice_head(n = 15)
ggplot(top_states, aes(x = reorder(State, AllegationsPer1000Schools),
y = AllegationsPer1000Schools, fill = Region)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
labs(title = "Top 15 States: Allegations per 1,000 Schools",
x = NULL, y = "Allegations per 1,000 Schools",
fill = "Region") +
theme_minimal(base_size = 12)ggplot(df, aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
geom_jitter(width = 0.1, alpha = 0.4, size = 1.5) +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "Allegations per 1,000 Schools by U.S. Region",
x = "Region", y = "Allegations per 1,000 Schools") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Do northeast states report significantly more harassment allegations per 1,000 schools than southern states?
Hypotheses:
\[H_0: \mu_{NE} = \mu_{S} \quad \text{(no difference in mean rates)}\]
\[H_A: \mu_{NE} > \mu_{S} \quad \text{(Northeast has a higher mean rate)}\]
Significance level: α = 0.05 (one-tailed)
Assumptions:
ne_rates <- df %>% filter(Region == "Northeast") %>% pull(AllegationsPer1000Schools)
s_rates <- df %>% filter(Region == "South") %>% pull(AllegationsPer1000Schools)
par(mfrow = c(1, 2))
qqnorm(ne_rates, main = "Q-Q Plot: Northeast"); qqline(ne_rates, col = "steelblue")
qqnorm(s_rates, main = "Q-Q Plot: South"); qqline(s_rates, col = "tomato")##
## Welch Two Sample t-test
##
## data: ne_rates and s_rates
## t = 2.1704, df = 9.191, p-value = 0.02873
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 216.1404 Inf
## sample estimates:
## mean of x mean of y
## 2246.2226 872.9464
df %>%
filter(Region %in% c("Northeast", "South")) %>%
ggplot(aes(x = Region, y = AllegationsPer1000Schools, fill = Region)) +
geom_boxplot(width = 0.4, outlier.colour = "red") +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
scale_fill_manual(values = c("Northeast" = "steelblue", "South" = "tomato")) +
labs(title = "Allegations per 1,000 Schools: Northeast vs South",
x = NULL, y = "Allegations per 1,000 Schools") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Interpretation:
Research Question: Does the number of schools in a state predict the total number of harassment and bullying allegations?
Hypotheses:
\[H_0: \beta_1 = 0 \quad \text{(number of schools has no linear effect on allegations)}\]
\[H_A: \beta_1 \neq 0\]
Model:
\[\text{Total Allegations}_i = \beta_0 + \beta_1 \times \text{NumSchools}_i + \varepsilon_i\]
Assumptions: Linearity, independence, normality of residuals
ggplot(df, aes(x = NumSchools, y = Total, colour = Region, label = State)) +
geom_point(size = 2.5, alpha = 0.8) +
geom_smooth(method = "lm", colour = "black", se = TRUE, linetype = "dashed") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
scale_colour_brewer(palette = "Set1") +
labs(title = "Total Allegations vs. Number of Schools by State",
x = "Number of Schools", y = "Total Allegations",
colour = "Region") +
theme_minimal(base_size = 12)##
## Call:
## lm(formula = Total ~ NumSchools, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8440.6 -705.5 -183.9 186.6 13869.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -79.6123 550.4040 -0.145 0.886
## NumSchools 1.4451 0.2074 6.969 7.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2761 on 49 degrees of freedom
## Multiple R-squared: 0.4978, Adjusted R-squared: 0.4875
## F-statistic: 48.56 on 1 and 49 DF, p-value: 7.421e-09
Key values:
## 2.5 % 97.5 %
## (Intercept) -1185.690503 1026.465954
## NumSchools 1.028384 1.861865
Interpretation:
Summary of Findings:
Strengths:
Limitations:
Future Directions:
Take home message: - Evidently harassment and bullying allegations in U.S. schools are not uniformly distributed. Where you go to school matters, and regional policy and reporting culture plays a significant role.
U.S. Department of Education, Office for Civil Rights. (2016). Civil Rights Data Collection: 2015–16. Retrieved from http://ocrdata.ed.gov
U.S. Census Bureau. (2023). Census regions and divisions of the United States. Retrieved from https://www.census.gov/geographies/reference-maps/2010/geo/2010-census-regions-and-divisions-of-the-united-states.html
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). SAGE Publications.