Our project will analyze trends and contributing factors of school shootings in the United States by modeling the data in a normalized PostgreSQL database. We’ll explore how variables like location, school type, incident characteristics, and demographics relate to shooting frequency and severity.
Communication:
We will use Slack for real-time communication and
Zoom for weekly meetings.
Code Sharing:
We host our code in GitHub Vaene/Cuny607Project3,
using branches and pull requests for collaboration.
Project Documentation:
Documentation is maintained in the rmd file or the repository’s
README.md.
We are using PostgreSQL for structured data storage, managed through DDEV, a containerized local development environment. What follows is the general outlines for that setup, not working code.
We will use multiple datasets related to school shootings in the U.S., sourced from:
“School Shootings in the U.S.”
“Gun Violence Archive (GVA)”
NCES (National Center for Education Statistics)
These raw datasets are stored as CSV files in the data/ directory.
We selected a small set of complementary real datasets that together provide incident-level details, school metadata, and supporting contextual series. The choice balances coverage, provenance, and the kinds of joins/analyses we want to run in the RMarkdown.
Rationale summary:
These files together let us compute incident counts per school, compare incident characteristics by school type or enrollment, and run simple spatial joins or fuzzy-name linking when exact keys are missing.
Before writing to Postgres we perform a few consistent steps to make
the loading robust and reproducible. The loader script
(scripts/load_to_pg.R) implements the following patterns;
the examples below show the rationale and the kinds of transformations
we apply.
Example (conceptual):
# read state files, make all columns character before binding
files <- list.files('data/real-data/sc091a_csv', full.names = TRUE)
dfs <- lapply(files, function(f) {
dat <- vroom::vroom(f, show_col_types = FALSE)
dat[] <- lapply(dat, as.character)
dat
})
schools_raw <- dplyr::bind_rows(dfs)
# now cast specific columns
schools <- schools_raw %>%
mutate(ncessch = as.character(ncessch),
enrollment = as.integer(enrollment))shooting_id -> incident_id,
Injuries -> victims_injured).TRUNCATE TABLE <table> CASCADE
and then append the cleaned rows. This prevents accidental DROP TABLE
commands that would remove dependent objects (indexes, FKs) and keeps
references intact.suspect_id, victim_id) we either deduplicate
in R before writing, or we filter out rows whose keys already exist in
the DB. The loader currently filters out suspected duplicates and logs
how many rows were skipped.victims.incident_id -> shootings.incident_id) we filter
victims to only those referencing existing shootings to avoid FK
violations. If you prefer, the loader can create minimal stub shooting
rows for missing incident IDs instead.db/schema.sql, splits it into
single statements and executes them sequentially to avoid this
issue.state, city or a PostGIS geometry index) to speed
up joins. PostGIS can be added to the DDEV Postgres container if you
plan on performing true spatial joins.If you want the loader to behave differently (for example, always upsert suspects instead of skipping duplicates, or create stub shootings for orphan victims), tell me which behavior you prefer and I’ll implement it.
We provide three companion R scripts in the scripts/
folder to make data ingestion, analysis, and visualization reproducible
and fast. Include these scripts in your workflow as described below.
improved_load_to_pg.R (Enhanced Data Loader)Key improvements over the original loader:
lower(name) with pg_trgm for
fuzzy joins and state, city B-tree indexes) to speed
matching and queries.vroom efficiently with
column selection and (where appropriate) uses Postgres COPY semantics to
load large tables in bulk.Usage:
What it does (summary):
pg_trgm for name similarity and creates GIN trigram index
on lower(name)).Example freshness-check pseudocode used in the loader:
check_data_freshness <- function(conn, csv_path, table_name, date_col) {
if (!DBI::dbExistsTable(conn, table_name)) return(TRUE)
db_max <- DBI::dbGetQuery(conn, sprintf('SELECT MAX(%s) AS m FROM %s', date_col, table_name))$m
# sample first 5000 rows to get CSV max date cheaply
csv_sample <- vroom::vroom(csv_path, n_max = 5000, col_select = date_col)
csv_max <- max(as.Date(csv_sample[[date_col]]), na.rm = TRUE)
return(is.na(db_max) || csv_max > as.Date(db_max))
}This saves time on repeated runs by skipping big reads when data hasn’t changed.
demographic_analysis.R (Complete Analysis)Purpose: run a focused demographic analysis on the loaded data and produce a multi-page PDF with four visualizations and console summary statistics.
Analyses included:
Usage:
Outputs:
demographic_analysis.pdf with 4 visualizationsNotes: the script joins shootings,
suspects, and victims (via the normalized
linking tables) and filters to the last 10 years by default; change the
window in the script arguments if needed.
create_visualizations.R (Individual Charts)Creates high-quality PNG charts for presentation or embedding in the report. This is useful if you want separate files for slides rather than the combined PDF.
Usage:
Charts produced include:
# First time - load all data
Rscript scripts/improved_load_to_pg.R
# Run demographic analysis
Rscript scripts/demographic_analysis.R
# Later - check will skip reload if data unchanged
Rscript scripts/improved_load_to_pg.R # outputs: "Data is current. Skipping reload." if unchanged
# Generate individual PNG charts
Rscript scripts/create_visualizations.RBelow are a few example SQL queries the analysis uses (implemented in
demographic_analysis.R) to compute metrics. They assume
standard joins through the linking table(s):
SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
AVG(v.age) AS avg_age,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY v.age) AS median_age
FROM shootings s
JOIN victims v ON v.incident_id = s.incident_id
WHERE v.age IS NOT NULL
GROUP BY year
ORDER BY year;SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
v.gender,
COUNT(*)::int AS count
FROM shootings s
JOIN victims v ON v.incident_id = s.incident_id
GROUP BY year, v.gender
ORDER BY year, count DESC;SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
100.0 * SUM(CASE WHEN suspect.mental_health_flag = TRUE THEN 1 ELSE 0 END)::float / COUNT(*) AS pct_with_mh
FROM shootings s
JOIN suspects suspect ON suspect.suspect_id = (SELECT ss.suspect_id FROM shooting_suspects ss WHERE ss.incident_id = s.incident_id LIMIT 1)
GROUP BY year
ORDER BY year;SELECT age_group, COUNT(*) FROM (
SELECT CASE
WHEN age < 13 THEN 'Under 13'
WHEN age BETWEEN 13 AND 17 THEN '13-17'
WHEN age BETWEEN 18 AND 24 THEN '18-24'
WHEN age BETWEEN 25 AND 34 THEN '25-34'
WHEN age BETWEEN 35 AND 54 THEN '35-54'
ELSE '55+' END AS age_group
FROM victims
WHERE age IS NOT NULL
) t
GROUP BY age_group
ORDER BY COUNT(*) DESC;If you’d like, I can also implement these three scripts in
scripts/ now (they’ll be runnable and tested), or I can
first commit just the Rmd documentation (done) and add the scripts in a
follow-up step. Which would you prefer?
Connecting to PostgreSQL in R We use the following R libraries for database operations and data manipulation:
# Recommended: read DB connection details from environment variables when possible.
# This keeps credentials out of source files and lets you switch between host vs in-container easily.
#
# Example: set these env vars on your machine (host) or in the container:
# DB_HOST (127.0.0.1 when connecting from host; 'db' when connecting from inside a ddev container)
# DB_PORT (forwarded port on host, e.g. 53466; 5432 inside container)
# DB_USER (usually 'db' for ddev)
# DB_PASS (usually 'db' for ddev)
# DB_NAME (usually 'db')
Sys.setenv(
DB_HOST = Sys.getenv("DB_HOST", "127.0.0.1"),
DB_PORT = Sys.getenv("DB_PORT", "53466"),
DB_USER = Sys.getenv("DB_USER", "db"),
DB_PASS = Sys.getenv("DB_PASS", "db"),
DB_NAME = Sys.getenv("DB_NAME", "db")
)
con <- dbConnect(
RPostgres::Postgres(),
host = Sys.getenv("DB_HOST"), # 127.0.0.1 (host) or 'db' (in-container)
port = as.integer(Sys.getenv("DB_PORT")),
user = Sys.getenv("DB_USER"),
password = Sys.getenv("DB_PASS"),
dbname = Sys.getenv("DB_NAME")
)You can test the connection with:
If you prefer explicit examples, here are the two common cases (keep eval=FALSE in documentation):
# When running R on your macOS host (outside ddev)
con <- dbConnect(
RPostgres::Postgres(),
host = "127.0.0.1", # use forwarded host port from `ddev describe`
port = 53466, # replace with your forwarded port
user = "db",
password = "db",
dbname = "db"
)# When running R inside the ddev project (e.g. via `ddev ssh` or an in-container RStudio)
con <- dbConnect(
RPostgres::Postgres(),
host = "db", # internal ddev hostname
port = 5432,
user = "db",
password = "db",
dbname = "db"
)# Helpful check (run after setting the environment variables above)
tryCatch({
con <- dbConnect(
RPostgres::Postgres(),
host = Sys.getenv("DB_HOST"),
port = as.integer(Sys.getenv("DB_PORT")),
user = Sys.getenv("DB_USER"),
password = Sys.getenv("DB_PASS"),
dbname = Sys.getenv("DB_NAME")
)
print(dbGetQuery(con, "SELECT version();"))
dbDisconnect(con)
}, error = function(e) {
message("Connection failed: ", e$message)
message("Checklist: is ddev running? Did you run `ddev describe` and use the forwarded port? Are credentials correct?")
})##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
library(tidyr)
# Suppress scientific notation for better readability
options(scipen = 999)
con <- dbConnect(
RPostgres::Postgres(),
host = "127.0.0.1",
port = 53466,
user = "db",
password = "db",
dbname = "db"
)
# Verify connection
dbGetQuery(con, "SELECT version();")## version
## 1 PostgreSQL 15.14 (Debian 15.14-1.pgdg13+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 14.2.0-19) 14.2.0, 64-bit
# CREATE the table reference FIRST (this line was missing!)
shootings_tbl <- tbl(con, "shootings")
# NOW you can query it
shootings_summary <- shootings_tbl %>%
group_by(state) %>%
summarise(total_incidents = n()) %>%
arrange(desc(total_incidents)) %>%
collect()
shootings_summary## # A tibble: 52 × 2
## state total_incidents
## <chr> <int64>
## 1 Illinois 34819
## 2 Texas 27079
## 3 California 27007
## 4 Pennsylvania 20324
## 5 Florida 20212
## 6 Ohio 17951
## 7 New York 15864
## 8 North Carolina 14401
## 9 Georgia 14215
## 10 Louisiana 14177
## # ℹ 42 more rows
## === DATABASE SUMMARY ===
## Total shooting incidents: 390,837
## Total schools: 122,398
## Suspect records: 2 (insufficient for demographic analysis)
## Date range: 1840-11-12 to 2023-12-31 (5.97819e-322 years)
| State | Incidents | Killed | Injured | Avg Casualties |
|---|---|---|---|---|
| Illinois | 34,819 | 8,636 | 33,802 | 1.22 |
| Texas | 27,079 | 12,249 | 17,945 | 1.12 |
| California | 27,007 | 12,124 | 19,084 | 1.16 |
| Pennsylvania | 20,324 | 6,223 | 17,624 | 1.17 |
| Florida | 20,212 | 8,299 | 15,176 | 1.16 |
| Ohio | 17,951 | 6,134 | 14,604 | 1.16 |
| New York | 15,864 | 4,171 | 14,520 | 1.18 |
| North Carolina | 14,401 | 5,548 | 11,178 | 1.16 |
| Georgia | 14,215 | 6,189 | 10,041 | 1.14 |
| Louisiana | 14,177 | 5,548 | 11,303 | 1.19 |
| Tennessee | 13,127 | 4,672 | 10,419 | 1.15 |
| Michigan | 11,759 | 4,147 | 9,497 | 1.16 |
| Maryland | 11,699 | 4,338 | 9,434 | 1.18 |
| Missouri | 11,215 | 5,108 | 8,094 | 1.18 |
| Virginia | 10,874 | 3,718 | 8,846 | 1.16 |
| Indiana | 10,073 | 3,900 | 7,708 | 1.15 |
| Alabama | 9,976 | 4,487 | 6,960 | 1.15 |
| South Carolina | 9,671 | 3,931 | 7,285 | 1.16 |
| Wisconsin | 8,326 | 2,182 | 7,032 | 1.11 |
| New Jersey | 6,832 | 2,316 | 6,039 | 1.22 |
### Temporal Trends (Past Decade)
# Annual statistics
annual_data <- dbGetQuery(con, "
SELECT
EXTRACT(YEAR FROM incident_date)::INTEGER as year,
COUNT(*)::INTEGER as incidents,
SUM(victims_killed)::INTEGER as killed,
SUM(victims_injured)::INTEGER as injured,
(SUM(victims_killed) + SUM(victims_injured))::INTEGER as total_casualties
FROM shootings
WHERE incident_date >= CURRENT_DATE - INTERVAL '10 years'
AND incident_date IS NOT NULL
GROUP BY year
ORDER BY year
")
kable(annual_data,
caption = "Annual Shooting Statistics (Past Decade)",
format.args = list(big.mark = ","))| year | incidents | killed | injured | total_casualties |
|---|---|---|---|---|
| 2,015 | 6,532 | 2,352 | 4,941 | 7,293 |
| 2,016 | 36,685 | 12,869 | 28,652 | 41,521 |
| 2,017 | 37,881 | 13,424 | 29,488 | 42,912 |
| 2,018 | 34,602 | 12,435 | 26,155 | 38,590 |
| 2,019 | 36,347 | 13,014 | 28,388 | 41,402 |
| 2,020 | 46,697 | 17,098 | 37,522 | 54,620 |
| 2,021 | 48,481 | 18,474 | 38,692 | 57,166 |
| 2,022 | 46,517 | 17,756 | 36,703 | 54,459 |
| 2,023 | 43,586 | 16,274 | 34,581 | 50,855 |
# Line chart - incidents
p1 <- ggplot(annual_data, aes(x = year, y = incidents)) +
geom_line(color = 'darkred', size = 1.5) +
geom_point(size = 4, color = 'darkred') +
geom_smooth(method = 'lm', se = TRUE, alpha = 0.2) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Annual Shooting Incidents Over Time",
subtitle = "Past decade with linear trend",
x = "Year", y = "Number of Incidents") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = 'bold'))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
# Statistical test
if (nrow(annual_data) >= 3) {
trend_test <- cor.test(annual_data$year, annual_data$incidents)
cat("\n**Trend Analysis:**\n")
cat(sprintf("Correlation: r = %.3f, p = %.4f\n",
trend_test$estimate, trend_test$p.value))
if (trend_test$p.value < 0.05) {
cat(sprintf("✓ Statistically significant %s in incidents over time\n\n",
ifelse(trend_test$estimate > 0, "INCREASE", "DECREASE")))
}
}##
## **Trend Analysis:**
## Correlation: r = 0.758, p = 0.0181
## ✓ Statistically significant INCREASE in incidents over time
# Casualties stacked area chart
casualties_long <- annual_data %>%
select(year, killed, injured) %>%
pivot_longer(cols = c(killed, injured),
names_to = "casualty_type",
values_to = "count")
p2 <- ggplot(casualties_long, aes(x = year, y = count, fill = casualty_type)) +
geom_area(alpha = 0.7) +
scale_fill_manual(values = c("killed" = "darkred", "injured" = "orange"),
labels = c("Deaths", "Injuries")) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Annual Casualties from Shooting Incidents",
x = "Year", y = "Number of People", fill = "Type") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = 'bold'))
print(p2)
### Monthly Patterns
### Severity Distribution
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
##
## ### Deadliest Incidents (≥10 casualties)
| Date | State | Killed | Injured | Total |
|---|---|---|---|---|
| 2017-10-01 | Nevada | 60 | 439 | 499 |
| 2016-06-12 | Florida | 49 | 53 | 102 |
| 2019-08-03 | Texas | 23 | 23 | 46 |
| 2017-11-05 | Texas | 26 | 20 | 46 |
| 2022-05-24 | NA | 22 | 18 | 40 |
| 2022-05-24 | Texas | 21 | 17 | 38 |
| 2022-07-04 | Illinois | 7 | 30 | 37 |
| 2023-04-15 | Alabama | 4 | 32 | 36 |
| 2018-02-14 | Florida | 17 | 17 | 34 |
| 2018-02-14 | NA | 17 | 17 | 34 |
| 2015-12-02 | California | 14 | 19 | 33 |
| 2023-10-25 | Maine | 18 | 13 | 31 |
| 2019-08-31 | Texas | 7 | 23 | 30 |
| 2023-07-02 | Maryland | 2 | 27 | 29 |
| 2022-03-19 | Arkansas | 1 | 26 | 27 |
##
## ### Summary Statistics
## **Total casualties (past decade):** 388,796
## - Deaths: 123,696
## - Injuries: 265,122
##
## **Average casualties per incident:** 1.15
##
## **Most affected state:** Illinois (34,819 incidents)
Our schema follows a normalized design (3NF) to minimize redundancy.
Schools (school_id, name, city, state, school_type, enrollment)
Shootings (shooting_id, date, school_id, num_killed, num_injured, weapon_type, motive)
Suspects (suspect_id, name, age, gender, mental_health_flag)
Victims (victim_id, shooting_id, age, role, injury_status)
Weapons (weapon_id, type, legally_owned, owner_id)
## Normalization Notes - The
schema is in Third Normal Form (3NF):
Each table has a primary key.
Non-key attributes depend only on the primary key.
Many-to-many relationships are handled via linking tables (shooting_suspects, suspect_weapons).
We would have liked to have delved further into the analysis of suspects, but we just were not able to find the data to make a meaningful analysis. But I think we got some key insights into the demographics of school shootings themselves that can better help us understand the problem.
But this exercise also had the focus of which are the most valued data science skills?
To be an effective data analyst, you need the right mix of technical ability, problem-solving skills, and communication. SQL is one of the most essential skills because it’s the industry standard for accessing and managing data, and Excel is still widely used for quick analysis and smaller datasets. According to an article by Northeastern University (O’Connor, 2024), SQL and Excel are considered two of the top foundational tools every analyst should know. Programming skills in Python or R help with automation, large-scale data analysis, and statistical work. Data visualization tools like Tableau or Power BI are also important for telling clear and compelling stories with data. As the Northeastern article points out, communication and presentation are just as critical as technical analysis, since analysts need to make insights understandable for non-technical audiences. On the soft skills side, critical thinking, attention to detail, and problem-solving help analysts interpret data correctly and avoid errors. Collaboration and domain knowledge like understanding the business context make it easier to translate data into real decisions. Altogether, the strongest analysts combine technical tools, analytical thinking, and strong communication to deliver insights that matter. On the personal side I would also say that perseverence, adaptabilty, and integrity are important as well. Its often hard to find the exact data you are looking for in a “usable” form. Getting different data sets to play nicely with each other takes a lot of trial and error sometimes, so you need to have a strong sense of perserverence to get through when the going gets tough. Also, sometimes your first choices are just flat wrong. Being adaptable enough to change course can mean the difference between failure and success. And lastly a sense of integrity can keep you on the right path. Nove of us are free from bias and in fact it is that bias that drives most of us to ananlyse the subject we choose. But being honest during the actual analyses and putting in the necessary safeguards to compensate for your bias is critical to having results that can be depended on. Reference: O’Connor, S. W. (2024, July 9). 7 must-have skills for data analysts. Northeastern University Graduate Programs