Project 1 - School Shooting Incidents

Author

A Porambo

Project 1 -School Shooting Incidents

Columbine Memorial Wall of Healing, Littleton, CO - Source: Library of Congress

Introductory Essay

The dataset I seek to explore lists and describes over 2,000 individual school shooting incidents in the United States from 1970 to 2022. This data was collected by the CHDS School Shooting Safety Compendium at the Naval Postgraduate School Center for Homeland Defense and Security.

For the purpose of this database, school shootings are defined as any incident where “a gun is brandished, is fired, or a bullet hits school property for any reason, regardless of the number of victims (including zero), time, day of the week, or reason.” Variables include the following:

  • Verified primary source citations;

  • a reliability score quantifying the dependability of the information (Reliability);

  • date and academic quarter during which the incident occurred;

  • the name of the school, the level of the school (School_Level), and the city and state where the school is located;

  • location of incident (Location) and type of location (Location_Type) in relation to the school campus;

  • Time of first shot (First_Shot) and time period when it occurs;

  • a brief summary and a more in-depth narrative of the incident;

  • what precipitated the shooting (Situation) and whether or not individuals were targeted or shot at random (Targets);

  • and various other variables, making up a total of 30 in the raw dataset.

With such a descriptive dataset recording these incidents over a period of nearly 50 years, I hope to explore the nature of school shootings and see how they have changed over time. We’re school shootings mostly unplanned in the 1970s and more pre-planned now, or vice-versa? Do shootings occur most often in high Schools, middle schools, or elementary schools? How many shots are typically fired per incident? I hope to have a greater understanding of these violent incidents that have become a regular occurrence in American life.

Load Library and Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # loads necessary library packages
library(treemap)
library(dplyr)
library(alluvial)
library(ggalluvial)
setwd("/Users/Owner/Documents/DATA 110/Week 3")
Sch_Sho_Incdt <- read_csv("Incident 2.csv") # loads dataset in .csv format 
Rows: 2069 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (27): Incident_ID, Sources, Number_News, Media_Attention, Quarter, Scho...
dbl   (2): Reliability, Shots_Fired
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean

Sch_Sho_Incdt_Cln <- Sch_Sho_Incdt[, -c(2:5)] # Removes columns for variables "Sources", "Number_News", "Media_Attention", and "Reliability."
Sch_Sho_Incdt_Cln2 <- Sch_Sho_Incdt_Cln[,-c(13:14)] # Removes columns for variables "Summary" and "Narrative."
Sch_Sho_Incdt_Cln3 <- Sch_Sho_Incdt_Cln2 |>  filter(!is.na(Sch_Sho_Incdt_Cln2$Quarter) & !is.na(Sch_Sho_Incdt_Cln2$School_Level) & !is.na(Sch_Sho_Incdt_Cln2$Location) & !is.na(Sch_Sho_Incdt_Cln2$Location_Type) & !is.na(Sch_Sho_Incdt_Cln2$During_School) & !is.na(Sch_Sho_Incdt_Cln2$Time_Period) & !is.na(Sch_Sho_Incdt_Cln2$First_Shot) & !is.na(Sch_Sho_Incdt_Cln2$Situation) & !is.na(Sch_Sho_Incdt_Cln2$Targets) & !is.na(Sch_Sho_Incdt_Cln2$Accomplice) & !is.na(Sch_Sho_Incdt_Cln2$Hostages) & !is.na(Sch_Sho_Incdt_Cln2$Barricade) & !is.na(Sch_Sho_Incdt_Cln2$Officer_Involved) & !is.na(Sch_Sho_Incdt_Cln2$Bullied) & !is.na(Sch_Sho_Incdt_Cln2$Domestic_Violence) & !is.na(Sch_Sho_Incdt_Cln2$Gang_Related) & !is.na(Sch_Sho_Incdt_Cln2$Preplanned) & !is.na(Sch_Sho_Incdt_Cln2$Shots_Fired) & !is.na(Sch_Sho_Incdt_Cln2$Active_Shooter_FBI)) 
  # Removes N/A values in columns where they exist.

Exploratory Visualizations

Bar Graph

ggplot(Sch_Sho_Incdt_Cln3, aes(x = School_Level)) + # Set School Level as the x-axis.
  geom_bar() + # Creates a simple bar graph showing the number of recorded school shootings between 1970 and 2022 by the level of the affiliated school.
labs(x ="School Type", y = "Number of Incidents", title = "School Shootings by School Level,1970 -2022")

Stacked Bar Plot

ggplot(Sch_Sho_Incdt_Cln3, aes(x = Situation, fill = Preplanned)) + # Set Situation as the x-axis and Preplanned as the color-coded fill.
  geom_bar() + # Created a bar plot.
  theme(axis.text.x = element_text(angle = 90)) + # Flipped the x-axis labels by 90 degrees for legibility.
  scale_fill_brewer(palette = "Set1") + # Made Set1 the palette for the fills.
  labs(y = "Number of Incidents", title = "Unplanned and Preplanned School Shootings by Situation, 1970 - 2022") # Added a title. # Revised the y-axis label and added a title.

Density Plot

ggplot(Sch_Sho_Incdt_Cln3, aes(x = Date, color = Preplanned, fill = Preplanned)) + # Set Date for the x-axis and Preplanned as the color and fill of the plot.
         geom_density(alpha = 0.5) + # Set transparency.
  scale_fill_brewer(palette = "Set2") + # Applied the Set2 palette for the fill of the density curves.
  labs(x ="Date of Recorded School Shooting", y = "Distribution of Incidents", title = "Unplanned and Preplanned School Shootings Over Time, 1970 -2022", caption = "Source: The Naval Postgraduate School Center for Homeland Defense and Security") # Revised x- and y-axis labels and added a title and a caption.

Box Plot

ggplot(Sch_Sho_Incdt_Cln3, aes(x = Preplanned, y = Shots_Fired)) +
         geom_boxplot() +
         scale_color_manual(values =  "pink", "green", "purple") +
         labs(x ="Preplaned or Unplaned School Shooting", y = "Total Shots Fired per Incident", title = "Total Shots Fired per School Shooting, Unplanned and Preplanned, 1970 -2022", caption = "Source: The Naval Postgraduate School Center for Homeland Defense and Security") # Revised x- and y-axis labels and added a title and a caption.

Concluding Essay

After uploading the data set, I removed the columns for the variables “Sources”, “Number_News”, “Media Attention”, “Reliability”, “Summary” and “Narrative.” These weren’t directly relevant to the type of analysis I wanted to do, so I decided to remove them and save R Studio the memory space. Next, I removed all “na” values from the dataset in order to make it more compatible with the forms of analysis I hoped to carry out. To do so, before importing the data into R Studio I checked which variables included “null” values. I was then able to remove “na” values variable by variable after they were imported into R Studio.

I began my analysis with a couple of exploratory graphs. The first was a simple bar graph showing the number of school shooting incidents between 1970-2022 by the level of the associated school. It is evident from this graph that among the incidents recorded, most by far were associated with High Schools. This makes sense to me- due to hormonal changes from their ongoing development, students in high school can be more aggressive than those in either Middle School or Elementary School. What did surprise me was that there were nearly double to number of school shootings recorded in Elementary Schools than there were in Middle Schools. I thought it would’ve been the other way around, because Middle School often coincides with the start of puberty, where emotions begin to run high.

I followed this bar graph with a more-complex stacked bar graph to view the number of school shootings divided by the type of situation. Each bar was subdivided further by whether or not a shooting was preplanned, indicated by segments of different colors. The most frequent situation precipitating a school shooting is “escalation of dispute.” This makes sense, as aggression or anger are often precursors to violence. It is strikingly evident that most school shootings, across all recorded situations, are not preplanned. The bars are almost completely blue, the color used to indicate shootings that weren’t preplanned in the key. The situation, however, with the highest count of planned school shootings was indiscriminate shooting, which sounds both contradictory in theory and outright chilling. These may very well include the type of incidents that comes to mind when you hear the phrase “school shooting”- a Columbine, Virginia Tech, or Sandy Hook - but this would require a more in-depth search through the data points to be certain.

Still looking at whether or not a school shooting was planned, I created a density plot of the relative distribution of planned and unplanned incidents over time. The relative number of all school shootings remained mostly static until a gradual rise began around the year 2000. This may or may not be evidence of the so-called “Columbine Effect” in which mass shooters seemingly copied the methodologies of the perpetrators of that incident, which was one of the largest media events of it’s decade. The number of planned school shootings began to outpace those that were unplanned, until the middle of the 2010s when unplanned incidents caught up with them. These continued to rise, but then outright exploded post-2020, when social, political, and racial tensions reached a boiling point amidst the COVID-19 lockdowns.

Wanting to get a measure of the degree of violence of these incidents, I graphed a final boxplot of the number of shots fired per incident separated by whether or not a shooting was preplanned.The resulting graph revealed how the majority of recorded total shots fired, about 3/4 of the values in each box plot, were less than 20 shots per incident. The interquartile ranges of each box plot were so thin that you can’t even discern the division between the second and third quartiles for two of the variables. The outliers, however, on the plot of school shootings that were preplanned are significantly larger and spread over a wider range than those for the other x-variable values. It makes sense that perpatrators who arrived with the intention to commit these atrocities would prepare to use more bullets than those who didn’t plan their actions.

It is worth noting the amount of “null” values found in a number of variables. Although I removed the “na”s, I left the “null”s (minus those in the “Shots_Fired” variable; for more see below) because null is in and of itself a value. If I removed every observation that had a variable with a null value, I would be left with a population of less than 200 observations. This serves to underscore how, even though we are able to gather an enormous amount of data about school shootings, so much about these incidents is still unknown.

In order to explore the number of shots fired per incident, I’d originally tried to create an alluvial plot of the changes in number of shots fired per incident over time, with the alluvial separated by situation type. Unfortunately R didn’t recognize many of the values listed under “Shots_Fired” as numerical. These included intervals (“10-30”), greater than/less than (“>10”), value plus (“20+”), “multiple”, “null” and blank spaces. Here’s how I addressed each of the following:

Intervals (“30-50”) : I used the median of the numbers in the interval. For example, I turned “30-50” into “40.”

Less Than (“<6”): I used the median of the numbers between the relative value and zero. For example, “<6” became “3.”

Greater Than/Plus: (“>10”, “20+”): I erred on the conservative side and only added 1 to the relative number. For example, “>10” became “11” and “20+” became “21.”

“Null”: For the Shots_Fired variable only, I turned all of the “null”’s into “0”’s for graphing purposes.

“Multiple”: After making the previous edits, I found the mean of all of the numerical values for “Shots_Fired”, 3.57, which was then rounded up to 4. I replaced every “multiple” data point in that column as “4” instead.

Even after these changes, the plot wouldn’t work. After much trial and error, it appears that alluvial plots don’t work if the x-axis value - in this case, the date of the incident, appears in multiple observations. For the purposes of time and completion, I had to move on. I decided to explore the shots fired through the variable of whether a shooting was preplanned or not in the meantime.