Group Collaboration and Database Design Document

Group Members

  • Sabina Baraili
  • Escarlet Gabriel
  • Randy Howk

Working Thesis

Our project will analyze trends and contributing factors of school shootings in the United States by modeling the data in a normalized PostgreSQL database. We’ll explore how variables like location, school type, incident characteristics, and demographics relate to shooting frequency and severity.

Collaboration Tools

Communication:
We will use Slack for real-time communication and Zoom for weekly meetings.

Code Sharing:
We host our code in GitHub Vaene/Cuny607Project3, using branches and pull requests for collaboration.

Project Documentation:
Documentation is maintained in the rmd file or the repository’s README.md.


PostgreSQL Integration (via DDEV)

We are using PostgreSQL for structured data storage, managed through DDEV, a containerized local development environment. What follows is the general outlines for that setup, not working code.

DDEV Setup

ddev config --project-type=php --docroot=web --create-docroot
ddev add-service postgres
ddev start
ddev describe

Default credentials:

Host: db Port: 5432 User: postgres Password: db Database: db To connect manually:

ddev psql

Data Sources

We will use multiple datasets related to school shootings in the U.S., sourced from:

Kaggle:

“School Shootings in the U.S.”

“Gun Violence Archive (GVA)”

### FBI Uniform Crime Reporting (UCR)

NCES (National Center for Education Statistics)

These raw datasets are stored as CSV files in the data/ directory.

Why we chose the real CSVs we use

We selected a small set of complementary real datasets that together provide incident-level details, school metadata, and supporting contextual series. The choice balances coverage, provenance, and the kinds of joins/analyses we want to run in the RMarkdown.

  • Gun Violence Archive (GVA): a near-real-time, incident-level dataset containing dates, locations (lat/lon when available), and counts of victims injured/killed. GVA is our primary source for large-scale incident counts and coordinates which are useful for spatial joins.
  • NCES school-level files (sc091a / ag091a): authoritative school registry records (the National Center for Education Statistics) that provide school identifiers (ncessch), school names, addresses, types, and enrollment. NCES is the canonical source for mapping incidents to schools.
  • USA_school_shootings.csv (historical curated dataset): includes cases and narrative context that supplement GVA’s tabular fields for older or notable incidents.
  • weapons.json (supplementary time series / rates): provides weapon-related time series or rates which we store in a separate table for trend analysis.

Rationale summary:

  • Use GVA for scale and geocoordinates (good for spatial linking and statewide trends).
  • Use NCES so that incidents can be mapped to stable school identifiers (ncessch) rather than relying solely on free-text school names.
  • Use the historical CSV to retain additional cases and narrative that may not be fully represented in GVA.
  • Keep weapons/time-series data separate so it can be joined by date/region rather than row-wise to incidents.

These files together let us compute incident counts per school, compare incident characteristics by school type or enrollment, and run simple spatial joins or fuzzy-name linking when exact keys are missing.

How we clean and coerce the data for PostgreSQL

Before writing to Postgres we perform a few consistent steps to make the loading robust and reproducible. The loader script (scripts/load_to_pg.R) implements the following patterns; the examples below show the rationale and the kinds of transformations we apply.

  1. Coerce heterogeneous CSV column types to character when binding many files.
  • Problem: NCES publishes one CSV per state. Some columns that look numeric in one state’s file are encoded differently in another state’s file (empty strings, leading zeros, or different column types). bind_rows across these files will fail if column types conflict.
  • Solution: read each file, convert all columns to character (or a safe common type) and only then bind. After binding we explicitly cast columns we know should be numeric or date.

Example (conceptual):

# read state files, make all columns character before binding
files <- list.files('data/real-data/sc091a_csv', full.names = TRUE)
dfs <- lapply(files, function(f) {
  dat <- vroom::vroom(f, show_col_types = FALSE)
  dat[] <- lapply(dat, as.character)
  dat
})
schools_raw <- dplyr::bind_rows(dfs)
# now cast specific columns
schools <- schools_raw %>%
  mutate(ncessch = as.character(ncessch),
      enrollment = as.integer(enrollment))
  1. Normalize column names and keys
  • Lowercase column names and rename similar columns to canonical names (e.g., shooting_id -> incident_id, Injuries -> victims_injured).
  1. Parse and coerce dates and numerics carefully
  • Use explicit parsing: as.Date(date, format = “%Y-%m-%d”) or lubridate helpers when needed.
  • When coercing with as.integer/as.numeric we document that NAs may be introduced and those rows are either kept (with NULLs in Postgres) or filtered depending on the table and FK constraints.
  1. Apply safe write semantics to Postgres
  • The loader uses a conservative refresh pattern: when we want to replace a table we run TRUNCATE TABLE <table> CASCADE and then append the cleaned rows. This prevents accidental DROP TABLE commands that would remove dependent objects (indexes, FKs) and keeps references intact.
  1. Avoid duplicate primary keys and enforce FK integrity
  • For tables with client-supplied primary keys (e.g., suspect_id, victim_id) we either deduplicate in R before writing, or we filter out rows whose keys already exist in the DB. The loader currently filters out suspected duplicates and logs how many rows were skipped.
  • For FK constraints (e.g., victims.incident_id -> shootings.incident_id) we filter victims to only those referencing existing shootings to avoid FK violations. If you prefer, the loader can create minimal stub shooting rows for missing incident IDs instead.
  1. Split multi-statement SQL for schema application
  • Some DB drivers reject multi-statement strings when preparing queries. The loader reads db/schema.sql, splits it into single statements and executes them sequentially to avoid this issue.
  1. Indexing and geospatial readiness
  • When loading coordinates we also create indexes (for example, a B-tree on state, city or a PostGIS geometry index) to speed up joins. PostGIS can be added to the DDEV Postgres container if you plan on performing true spatial joins.
  1. Logging and idempotency
  • The loader logs row counts for each step and is idempotent: repeated runs replace the main tables and append only new records where appropriate.

If you want the loader to behave differently (for example, always upsert suspects instead of skipping duplicates, or create stub shootings for orphan victims), tell me which behavior you prefer and I’ll implement it.

Automated loader and analysis scripts

We provide three companion R scripts in the scripts/ folder to make data ingestion, analysis, and visualization reproducible and fast. Include these scripts in your workflow as described below.

1) improved_load_to_pg.R (Enhanced Data Loader)

Key improvements over the original loader:

  • Data Freshness Check: compares the max date in the DB vs the CSV and skips a full reload unless the CSV contains newer data.
  • Skip Unnecessary Loads: avoids re-reading large files when the database is already current.
  • Performance Indexes: creates indexes on commonly queried columns (for example, lower(name) with pg_trgm for fuzzy joins and state, city B-tree indexes) to speed matching and queries.
  • Faster Reading and Loading: uses vroom efficiently with column selection and (where appropriate) uses Postgres COPY semantics to load large tables in bulk.

Usage:

Rscript scripts/improved_load_to_pg.R

What it does (summary):

  • Checks freshness: by sampling the CSV for its max date (fast — only reads the first N rows) and comparing with the DB max date.
  • If CSV is newer, reads the CSV (chunked when possible), cleans and coerces columns, and bulk-loads to Postgres.
  • Builds or verifies performance indexes (optionally enables pg_trgm for name similarity and creates GIN trigram index on lower(name)).

Example freshness-check pseudocode used in the loader:

check_data_freshness <- function(conn, csv_path, table_name, date_col) {
  if (!DBI::dbExistsTable(conn, table_name)) return(TRUE)
  db_max <- DBI::dbGetQuery(conn, sprintf('SELECT MAX(%s) AS m FROM %s', date_col, table_name))$m
  # sample first 5000 rows to get CSV max date cheaply
  csv_sample <- vroom::vroom(csv_path, n_max = 5000, col_select = date_col)
  csv_max <- max(as.Date(csv_sample[[date_col]]), na.rm = TRUE)
  return(is.na(db_max) || csv_max > as.Date(db_max))
}

This saves time on repeated runs by skipping big reads when data hasn’t changed.

2) demographic_analysis.R (Complete Analysis)

Purpose: run a focused demographic analysis on the loaded data and produce a multi-page PDF with four visualizations and console summary statistics.

Analyses included:

  • Age Trends: average, median, min/max age over time (by year)
  • Gender Distribution: percent breakdown by year
  • Mental Health Flags: proportion with documented mental-health indicators over time
  • Age Groups: distribution across six age brackets (Under 13, 13–17, 18–24, 25–34, 35–54, 55+)

Usage:

Rscript scripts/demographic_analysis.R

Outputs:

  • Console summary statistics
  • demographic_analysis.pdf with 4 visualizations

Notes: the script joins shootings, suspects, and victims (via the normalized linking tables) and filters to the last 10 years by default; change the window in the script arguments if needed.

3) create_visualizations.R (Individual Charts)

Creates high-quality PNG charts for presentation or embedding in the report. This is useful if you want separate files for slides rather than the combined PDF.

Usage:

Rscript scripts/create_visualizations.R

Charts produced include:

  • Average age trend (line + regression)
  • Gender distribution by year (stacked bar)
  • Age group distribution (stacked area)
  • Mental-health trend (line chart)

Example workflow

# First time - load all data
Rscript scripts/improved_load_to_pg.R

# Run demographic analysis
Rscript scripts/demographic_analysis.R

# Later - check will skip reload if data unchanged
Rscript scripts/improved_load_to_pg.R  # outputs: "Data is current. Skipping reload." if unchanged

# Generate individual PNG charts
Rscript scripts/create_visualizations.R

Sample queries used by analysis

Below are a few example SQL queries the analysis uses (implemented in demographic_analysis.R) to compute metrics. They assume standard joins through the linking table(s):

  1. Average age by year
SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
       AVG(v.age) AS avg_age,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY v.age) AS median_age
FROM shootings s
JOIN victims v ON v.incident_id = s.incident_id
WHERE v.age IS NOT NULL
GROUP BY year
ORDER BY year;
  1. Gender distribution by year
SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
       v.gender,
       COUNT(*)::int AS count
FROM shootings s
JOIN victims v ON v.incident_id = s.incident_id
GROUP BY year, v.gender
ORDER BY year, count DESC;
  1. Mental-health flags over time (percentage)
SELECT EXTRACT(YEAR FROM s.incident_date)::int AS year,
       100.0 * SUM(CASE WHEN suspect.mental_health_flag = TRUE THEN 1 ELSE 0 END)::float / COUNT(*) AS pct_with_mh
FROM shootings s
JOIN suspects suspect ON suspect.suspect_id = (SELECT ss.suspect_id FROM shooting_suspects ss WHERE ss.incident_id = s.incident_id LIMIT 1)
GROUP BY year
ORDER BY year;
  1. Age-group distribution (SQL-friendly age buckets)
SELECT age_group, COUNT(*) FROM (
  SELECT CASE
    WHEN age < 13 THEN 'Under 13'
    WHEN age BETWEEN 13 AND 17 THEN '13-17'
    WHEN age BETWEEN 18 AND 24 THEN '18-24'
    WHEN age BETWEEN 25 AND 34 THEN '25-34'
    WHEN age BETWEEN 35 AND 54 THEN '35-54'
    ELSE '55+' END AS age_group
  FROM victims
  WHERE age IS NOT NULL
) t
GROUP BY age_group
ORDER BY COUNT(*) DESC;

If you’d like, I can also implement these three scripts in scripts/ now (they’ll be runnable and tested), or I can first commit just the Rmd documentation (done) and add the scripts in a follow-up step. Which would you prefer?

Connecting to PostgreSQL in R We use the following R libraries for database operations and data manipulation:

library(DBI)
library(RPostgres)
library(dplyr)
library(readr)
library(dbplyr)

Establish Connection

# Recommended: read DB connection details from environment variables when possible.
# This keeps credentials out of source files and lets you switch between host vs in-container easily.
#
# Example: set these env vars on your machine (host) or in the container:
# DB_HOST (127.0.0.1 when connecting from host; 'db' when connecting from inside a ddev container)
# DB_PORT (forwarded port on host, e.g. 53466; 5432 inside container)
# DB_USER (usually 'db' for ddev)
# DB_PASS (usually 'db' for ddev)
# DB_NAME (usually 'db')

Sys.setenv(
  DB_HOST = Sys.getenv("DB_HOST", "127.0.0.1"),
  DB_PORT = Sys.getenv("DB_PORT", "53466"),
  DB_USER = Sys.getenv("DB_USER", "db"),
  DB_PASS = Sys.getenv("DB_PASS", "db"),
  DB_NAME = Sys.getenv("DB_NAME", "db")
)

con <- dbConnect(
  RPostgres::Postgres(),
  host = Sys.getenv("DB_HOST"),      # 127.0.0.1 (host) or 'db' (in-container)
  port = as.integer(Sys.getenv("DB_PORT")),
  user = Sys.getenv("DB_USER"),
  password = Sys.getenv("DB_PASS"),
  dbname = Sys.getenv("DB_NAME")
)

You can test the connection with:

dbGetQuery(con, "SELECT version();")

If you prefer explicit examples, here are the two common cases (keep eval=FALSE in documentation):

# When running R on your macOS host (outside ddev)
con <- dbConnect(
  RPostgres::Postgres(),
  host = "127.0.0.1",   # use forwarded host port from `ddev describe`
  port = 53466,           # replace with your forwarded port
  user = "db",
  password = "db",
  dbname = "db"
)
# When running R inside the ddev project (e.g. via `ddev ssh` or an in-container RStudio)
con <- dbConnect(
  RPostgres::Postgres(),
  host = "db",          # internal ddev hostname
  port = 5432,
  user = "db",
  password = "db",
  dbname = "db"
)
# Helpful check (run after setting the environment variables above)
tryCatch({
  con <- dbConnect(
    RPostgres::Postgres(),
    host = Sys.getenv("DB_HOST"),
    port = as.integer(Sys.getenv("DB_PORT")),
    user = Sys.getenv("DB_USER"),
    password = Sys.getenv("DB_PASS"),
    dbname = Sys.getenv("DB_NAME")
  )
  print(dbGetQuery(con, "SELECT version();"))
  dbDisconnect(con)
}, error = function(e) {
  message("Connection failed: ", e$message)
  message("Checklist: is ddev running? Did you run `ddev describe` and use the forwarded port? Are credentials correct?")
})

Loading and Cleaning Data in R

Load Data Using Automated Script

# Use the improved loader script instead of manual chunks
source('scripts/improved_load_to_pg.R')

Querying the Database in R

Connect to Database

library(DBI)
library(RPostgres)
library(dplyr)
library(dbplyr)
library(knitr)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)
library(tidyr)
# Suppress scientific notation for better readability
options(scipen = 999)

con <- dbConnect(
  RPostgres::Postgres(),
  host = "127.0.0.1",
  port = 53466,
  user = "db",
  password = "db",
  dbname = "db"
)

# Verify connection
dbGetQuery(con, "SELECT version();")
##                                                                                                                      version
## 1 PostgreSQL 15.14 (Debian 15.14-1.pgdg13+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 14.2.0-19) 14.2.0, 64-bit
# CREATE the table reference FIRST (this line was missing!)
shootings_tbl <- tbl(con, "shootings")

# NOW you can query it
shootings_summary <- shootings_tbl %>%
  group_by(state) %>%
  summarise(total_incidents = n()) %>%
  arrange(desc(total_incidents)) %>%
  collect()

shootings_summary
## # A tibble: 52 × 2
##    state          total_incidents
##    <chr>                  <int64>
##  1 Illinois                 34819
##  2 Texas                    27079
##  3 California               27007
##  4 Pennsylvania             20324
##  5 Florida                  20212
##  6 Ohio                     17951
##  7 New York                 15864
##  8 North Carolina           14401
##  9 Georgia                  14215
## 10 Louisiana                14177
## # ℹ 42 more rows

Comprehensive Shooting Incident Analysis

Data Overview

## === DATABASE SUMMARY ===
## Total shooting incidents: 390,837
## Total schools: 122,398
## Suspect records: 2 (insufficient for demographic analysis)
## Date range: 1840-11-12 to 2023-12-31 (5.97819e-322 years)

Geographic Analysis

Top 20 States by Shooting Incidents
State Incidents Killed Injured Avg Casualties
Illinois 34,819 8,636 33,802 1.22
Texas 27,079 12,249 17,945 1.12
California 27,007 12,124 19,084 1.16
Pennsylvania 20,324 6,223 17,624 1.17
Florida 20,212 8,299 15,176 1.16
Ohio 17,951 6,134 14,604 1.16
New York 15,864 4,171 14,520 1.18
North Carolina 14,401 5,548 11,178 1.16
Georgia 14,215 6,189 10,041 1.14
Louisiana 14,177 5,548 11,303 1.19
Tennessee 13,127 4,672 10,419 1.15
Michigan 11,759 4,147 9,497 1.16
Maryland 11,699 4,338 9,434 1.18
Missouri 11,215 5,108 8,094 1.18
Virginia 10,874 3,718 8,846 1.16
Indiana 10,073 3,900 7,708 1.15
Alabama 9,976 4,487 6,960 1.15
South Carolina 9,671 3,931 7,285 1.16
Wisconsin 8,326 2,182 7,032 1.11
New Jersey 6,832 2,316 6,039 1.22

### Temporal Trends (Past Decade)

# Annual statistics
annual_data <- dbGetQuery(con, "
  SELECT 
    EXTRACT(YEAR FROM incident_date)::INTEGER as year,
    COUNT(*)::INTEGER as incidents,
    SUM(victims_killed)::INTEGER as killed,
    SUM(victims_injured)::INTEGER as injured,
    (SUM(victims_killed) + SUM(victims_injured))::INTEGER as total_casualties
  FROM shootings
  WHERE incident_date >= CURRENT_DATE - INTERVAL '10 years'
    AND incident_date IS NOT NULL
  GROUP BY year
  ORDER BY year
")

kable(annual_data, 
      caption = "Annual Shooting Statistics (Past Decade)",
      format.args = list(big.mark = ","))
Annual Shooting Statistics (Past Decade)
year incidents killed injured total_casualties
2,015 6,532 2,352 4,941 7,293
2,016 36,685 12,869 28,652 41,521
2,017 37,881 13,424 29,488 42,912
2,018 34,602 12,435 26,155 38,590
2,019 36,347 13,014 28,388 41,402
2,020 46,697 17,098 37,522 54,620
2,021 48,481 18,474 38,692 57,166
2,022 46,517 17,756 36,703 54,459
2,023 43,586 16,274 34,581 50,855
# Line chart - incidents
p1 <- ggplot(annual_data, aes(x = year, y = incidents)) +
  geom_line(color = 'darkred', size = 1.5) +
  geom_point(size = 4, color = 'darkred') +
  geom_smooth(method = 'lm', se = TRUE, alpha = 0.2) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Annual Shooting Incidents Over Time",
       subtitle = "Past decade with linear trend",
       x = "Year", y = "Number of Incidents") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = 'bold'))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(p1)
## `geom_smooth()` using formula = 'y ~ x'

# Statistical test
if (nrow(annual_data) >= 3) {
  trend_test <- cor.test(annual_data$year, annual_data$incidents)
  cat("\n**Trend Analysis:**\n")
  cat(sprintf("Correlation: r = %.3f, p = %.4f\n", 
              trend_test$estimate, trend_test$p.value))
  if (trend_test$p.value < 0.05) {
    cat(sprintf("✓ Statistically significant %s in incidents over time\n\n",
                ifelse(trend_test$estimate > 0, "INCREASE", "DECREASE")))
  }
}
## 
## **Trend Analysis:**
## Correlation: r = 0.758, p = 0.0181
## ✓ Statistically significant INCREASE in incidents over time
# Casualties stacked area chart
casualties_long <- annual_data %>%
  select(year, killed, injured) %>%
  pivot_longer(cols = c(killed, injured), 
               names_to = "casualty_type", 
               values_to = "count")

p2 <- ggplot(casualties_long, aes(x = year, y = count, fill = casualty_type)) +
  geom_area(alpha = 0.7) +
  scale_fill_manual(values = c("killed" = "darkred", "injured" = "orange"),
                    labels = c("Deaths", "Injuries")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Annual Casualties from Shooting Incidents",
       x = "Year", y = "Number of People", fill = "Type") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = 'bold'))

print(p2)

### Monthly Patterns ### Severity Distribution

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## 
## ### Deadliest Incidents (≥10 casualties)
Most Severe Incidents (Past Decade)
Date State Killed Injured Total
2017-10-01 Nevada 60 439 499
2016-06-12 Florida 49 53 102
2019-08-03 Texas 23 23 46
2017-11-05 Texas 26 20 46
2022-05-24 NA 22 18 40
2022-05-24 Texas 21 17 38
2022-07-04 Illinois 7 30 37
2023-04-15 Alabama 4 32 36
2018-02-14 Florida 17 17 34
2018-02-14 NA 17 17 34
2015-12-02 California 14 19 33
2023-10-25 Maine 18 13 31
2019-08-31 Texas 7 23 30
2023-07-02 Maryland 2 27 29
2022-03-19 Arkansas 1 26 27

Key Findings

## 
## ### Summary Statistics
## **Total casualties (past decade):** 388,796
## - Deaths: 123,696
## - Injuries: 265,122
## 
## **Average casualties per incident:** 1.15
## 
## **Most affected state:** Illinois (34,819 incidents)

Logical Model and ER Diagram

Our schema follows a normalized design (3NF) to minimize redundancy.

Entities

  • Schools (school_id, name, city, state, school_type, enrollment)

  • Shootings (shooting_id, date, school_id, num_killed, num_injured, weapon_type, motive)

  • Suspects (suspect_id, name, age, gender, mental_health_flag)

  • Victims (victim_id, shooting_id, age, role, injury_status)

  • Weapons (weapon_id, type, legally_owned, owner_id)

ER Diagram

## Normalization Notes - The schema is in Third Normal Form (3NF):

  • Each table has a primary key.

  • Non-key attributes depend only on the primary key.

  • Many-to-many relationships are handled via linking tables (shooting_suspects, suspect_weapons).

Closing Connection

dbDisconnect(con)

Conclusions:

We would have liked to have delved further into the analysis of suspects, but we just were not able to find the data to make a meaningful analysis. But I think we got some key insights into the demographics of school shootings themselves that can better help us understand the problem.

But this exercise also had the focus of which are the most valued data science skills?

To be an effective data analyst, you need the right mix of technical ability, problem-solving skills, and communication. SQL is one of the most essential skills because it’s the industry standard for accessing and managing data, and Excel is still widely used for quick analysis and smaller datasets. According to an article by Northeastern University (O’Connor, 2024), SQL and Excel are considered two of the top foundational tools every analyst should know. Programming skills in Python or R help with automation, large-scale data analysis, and statistical work. Data visualization tools like Tableau or Power BI are also important for telling clear and compelling stories with data. As the Northeastern article points out, communication and presentation are just as critical as technical analysis, since analysts need to make insights understandable for non-technical audiences. On the soft skills side, critical thinking, attention to detail, and problem-solving help analysts interpret data correctly and avoid errors. Collaboration and domain knowledge like understanding the business context make it easier to translate data into real decisions. Altogether, the strongest analysts combine technical tools, analytical thinking, and strong communication to deliver insights that matter. On the personal side I would also say that perseverence, adaptabilty, and integrity are important as well. Its often hard to find the exact data you are looking for in a “usable” form. Getting different data sets to play nicely with each other takes a lot of trial and error sometimes, so you need to have a strong sense of perserverence to get through when the going gets tough. Also, sometimes your first choices are just flat wrong. Being adaptable enough to change course can mean the difference between failure and success. And lastly a sense of integrity can keep you on the right path. Nove of us are free from bias and in fact it is that bias that drives most of us to ananlyse the subject we choose. But being honest during the actual analyses and putting in the necessary safeguards to compensate for your bias is critical to having results that can be depended on. Reference: O’Connor, S. W. (2024, July 9). 7 must-have skills for data analysts. Northeastern University Graduate Programs