Introduction

This project examines school-related offenses across U.S. states during the 2017–2018 school year. The dataset includes different types of incidents, ranging from physical attacks to threats and firearm-related offenses, along with the number of schools in each state.

At first glance, it might seem like the goal is simply to identify which states have the most incidents. However, raw totals do not tell the full story. Larger states naturally report more offenses because they have more schools. That does not necessarily mean they are less safe.

Because of this, the analysis focuses on both total counts and normalized measures such as offenses per school. This allows for a more meaningful comparison across states and helps highlight where incidents are truly more concentrated.

Description of Project

A preliminary look at the dataset shows:

Offense Types: Incidents range from non-weapon physical attacks to firearm-related offenses.

State Variation: There is substantial variation in total incidents across states.

Scale Effects: Larger states tend to report more incidents, making normalization necessary.

Key Focus Areas:

  • Most common types of school offenses
  • Differences across states
  • Offenses relative to number of schools
  • Identification of outliers

Data Preparation

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('seaborn-v0_8-whitegrid')

plt.rcParams.update({
    "font.size": 11,
    "axes.titlesize": 16,
    "axes.labelsize": 12,
    "xtick.labelsize": 10,
    "ytick.labelsize": 10
})

df = pd.read_excel(
    r"C:\Users\jason\Desktop\ST 472\Zelterman Files Supp 2\Offenses.xlsx",
    header=2
)

df.columns = [
    "Ignore", "State",
    "Rape","Sexual Assault",
    "Robbery (Weapon)","Robbery (Firearm/Explosive)","Robbery (No Weapon)",
    "Attack (Weapon)","Attack (Firearm/Explosive)","Attack (No Weapon)",
    "Threat (Weapon)","Threat (Firearm/Explosive)","Threat (No Weapon)",
    "Firearm Possession",
    "Number of Schools","Percent Reporting"
]

df = df[df["State"].notna()]
df = df[df["State"] != "State"]
df = df[df["State"] != "50 states, District of Columbia, and Puerto Rico"]
df = df.reset_index(drop=True)

offense_cols = df.columns[2:-2]

df[offense_cols] = df[offense_cols].apply(pd.to_numeric, errors='coerce')
df["Number of Schools"] = pd.to_numeric(df["Number of Schools"], errors='coerce')

df = df.dropna(subset=offense_cols.tolist() + ["Number of Schools"])
df = df[df["Number of Schools"] > 0]

df["Total Offenses"] = df[offense_cols].sum(axis=1)
df["Offenses per School"] = df["Total Offenses"] / df["Number of Schools"]

Visualization 1: Total School Offenses by Type

The first visualization shows the total number of incidents for each offense type, and one pattern stands out immediately. Physical attacks without a weapon dominate the dataset by a wide margin, with over 800,000 reported incidents. The next closest category, threats without a weapon, is far behind at around 200,000.

After those two categories, there is a sharp drop. Most other offense types fall into much smaller ranges, especially those involving firearms or explosives. This creates a clear imbalance in the data where everyday conflicts make up the majority of reported incidents.

This suggests that school safety issues are less about rare extreme events and more about frequent, lower-level conflicts. It shifts the focus toward understanding behavior, interactions, and environment within schools rather than only focusing on severe but less common incidents.

totals = df[offense_cols].sum().sort_values(ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(totals.index, totals.values)

plt.title("Total School Offenses by Type", pad=15)
plt.xlabel("Number of Incidents")

plt.xlim(0, max(totals.values) * 1.1);

for i, v in enumerate(totals.values):
    plt.text(v, i, f"{int(v):,}", va='center', fontsize=9)

plt.tight_layout()
plt.show()


Visualization 2: School Offenses by State (Top 15)

The heatmap highlights the top 15 states based on total offenses and shows how those offenses are distributed across categories.

California clearly stands out across almost every category, which is expected given its size. However, other states like Texas, Georgia, and Illinois also show consistently higher levels across multiple offense types. This suggests that in these states, incidents are not isolated to one category but are spread across different types of behavior.

At the same time, some variation is visible. For example, certain states show relatively stronger intensity in specific categories like threats or robberies, even if their totals are not as high overall. This indicates that the nature of incidents can differ across states, not just the volume.

heat_df = df.set_index("State")[offense_cols]

top_states = heat_df.sum(axis=1).sort_values(ascending=False).head(15).index
heat_df_top = heat_df.loc[top_states]

heat_norm = heat_df_top / heat_df_top.max()

plt.figure(figsize=(12, 8))
plt.imshow(heat_norm, aspect='auto', cmap='viridis')

cbar = plt.colorbar()
cbar.set_label("Relative Intensity", rotation=270, labelpad=15);

plt.title("School Offenses by State (Top 15)", pad=15)

plt.xticks(range(len(offense_cols)), offense_cols, rotation=45, ha='right');
plt.yticks(range(len(heat_df_top.index)), heat_df_top.index);

plt.tight_layout()
plt.show()


Visualization 3: Offenses per School by State

Once the data is adjusted for the number of schools, the ranking changes in a meaningful way.

Louisiana and Nevada rise to the top, with significantly higher offenses per school than most other states. Georgia and the District of Columbia also appear near the top, followed by Maryland. These states were not all the largest in terms of total incidents, but they show a higher concentration of incidents within individual schools.

The gap between the top few states and the rest is noticeable. Louisiana, for example, stands clearly above the others, suggesting that incidents occur much more frequently at the school level there.

This visualization provides a more balanced comparison. It shows that some states experience more intense levels of incidents even if their total counts are lower. It shifts the focus from size to concentration, which is more useful when thinking about school environments.

top_rate = df.sort_values("Offenses per School", ascending=False).head(15)

plt.figure(figsize=(10, 6))
plt.barh(top_rate["State"], top_rate["Offenses per School"])

plt.title("Top 15 States by Offenses per School", pad=15)
plt.xlabel("Offenses per School")

plt.gca().invert_yaxis()

for i, v in enumerate(top_rate["Offenses per School"]):
    plt.text(v, i, f"{v:.1f}", va='center', fontsize=9)

plt.tight_layout()
plt.show()


Visualization 4: Total Offenses vs Number of Schools

The scatter plot shows a strong positive relationship between the number of schools and total offenses. This confirms what we would expect, which is that more schools generally lead to more reported incidents.

However, the more important insight comes from how individual states compare to the trendline. States like Louisiana, Georgia, Nevada, Maryland, and the District of Columbia sit above the line, meaning they report more incidents than expected given their number of schools.

These same states were identified earlier as having high offenses per school. Seeing them again here reinforces the idea that their higher incident levels are not just due to size. They are consistently above average across different ways of measuring the data.

This consistency across visualizations strengthens the conclusion that certain states have a higher concentration of school-related incidents, not just higher totals.

z = np.polyfit(df["Number of Schools"], df["Total Offenses"], 1)
p = np.poly1d(z)

top5_states = df.sort_values("Offenses per School", ascending=False).head(5)["State"]

colors = ["red" if s in top5_states.values else "gray" for s in df["State"]]

plt.figure(figsize=(8, 6))
plt.scatter(df["Number of Schools"], df["Total Offenses"], c=colors)

plt.plot(df["Number of Schools"], p(df["Number of Schools"]))

for i, row in df.iterrows():
    if row["State"] in top5_states.values:
        plt.text(
            row["Number of Schools"] + 80,
            row["Total Offenses"] + 1500,
            row["State"],
            fontsize=8,
            weight='bold',
            bbox=dict(facecolor='white', alpha=0.7, edgecolor='none')
        )

plt.title("Total Offenses vs Number of Schools", pad=15)
plt.xlabel("Number of Schools")
plt.ylabel("Total Offenses")

plt.tight_layout()
plt.show()


Visualization 5: Offenses per School by State (Ranked)

The ranked dot plot provides a full view of all states, ordered by offenses per school.

Most states cluster in the middle, roughly between 5 and 15 offenses per school. This suggests there is a common range where the majority of states fall. However, the top few states stand out clearly above this cluster.

Louisiana and Nevada are far ahead of the rest, followed by Georgia and the District of Columbia. There is a visible gap between these states and the majority, which makes the difference feel more significant.

At the lower end, states like Puerto Rico, North Dakota, and Hawaii have much lower values, indicating fewer incidents per school. This spread across the full ranking shows that school safety conditions are not evenly distributed.

Seeing all states together in one plot makes it easier to understand both the typical range and the outliers, which is harder to see in smaller subsets.

sorted_df = df.sort_values("Offenses per School")

colors = ["red" if s in top5_states.values else "gray" for s in sorted_df["State"]]

plt.figure(figsize=(9, 12))
plt.scatter(sorted_df["Offenses per School"], sorted_df["State"], c=colors)

plt.title("Offenses per School by State (Ranked)", pad=15)
plt.xlabel("Offenses per School")

plt.yticks(fontsize=8);

plt.tight_layout()
plt.show()


Conclusion

Across all visualizations, a consistent pattern emerges. While total incidents are highest in large states, adjusting for the number of schools reveals a different story.

A small group of states, including Louisiana, Nevada, Georgia, and the District of Columbia, repeatedly appear as having higher levels of incidents relative to their size. These states are not just outliers in one visualization, but across multiple perspectives. At the same time, most states fall within a moderate range, suggesting that there is a general baseline level of school-related incidents. The differences become most meaningful when looking at how far certain states rise above that baseline.

This matters because it changes how school safety should be understood and addressed. Focusing only on total incidents can hide where problems are more concentrated within individual schools. By shifting to a per-school perspective, it becomes clearer where students and staff may be experiencing these issues more frequently in their day-to-day environments.

This also has practical implications. It suggests that resources, policy decisions, and prevention efforts should be more targeted rather than evenly distributed. States that consistently appear above expected levels may benefit from more focused attention, while others may require different approaches.

More broadly, this analysis shows that how a problem is measured directly affects how it is interpreted. Looking beyond totals to consider concentration provides a clearer and more useful understanding of school-related incidents, which ultimately supports more informed decision making.