📚 Database Exam Study Guide

Mississippi Literacy Database System

📖 KEY TERMS & DEFINITIONS

Primary Key - Unique identifier for each table row (e.g., school_id) Foreign Key - Links to primary key of another table (e.g., district_id) Entity - Object about which data is stored (Schools, Districts) ERD - Visual diagram showing entities and relationships CRUD - Create, Read, Update, Delete operations cardinality - defines the number of instances of one entity that can be related to instances of another entity, typically categorized as one-to-one (1:1), one-to-many (1:N), or many-to-many (M:N)

1NF - Atomic values, no repeating groups 2NF - 1NF + no partial dependencies
3NF - 2NF + no transitive dependencies Denormalization - Trading normalization for performance

INNER JOIN - Only matching records from both tables LEFT JOIN - All left records + matching right records Aggregate Functions - COUNT(), SUM(), AVG(), MIN(), MAX() GROUP BY - Groups rows for aggregation Subquery - Query inside another query

Index - Improves query performance ETL - Extract, Transform, Load data process ORM - Maps objects to database tables (SQLAlchemy) API - Interface for accessing data (/api/districts)

🗄️ 1. DATABASE SCHEMA

Main Entities (8 tables)

Locations - Geographic data (county, city, ZIP)
Districts - 150 MS school districts
Schools - 1,003 individual schools
DemographicGroups - Student subgroup classifications
AcademicYears - Time periods for tracking
PerformanceRecords - 18,630 literacy performance metrics
TeacherQuality - Teacher experience metrics
Books - 259 book recommendations

Entity Relationships (1:M)

Locations (1) → Districts (Many) - One county/city can have multiple school districts - FK: Districts.location_id → Locations.location_id - Example: Harrison County has Biloxi, Gulfport, Long Beach districts

Districts (1) → Schools (Many)
- One district manages multiple schools - FK: Schools.district_id → Districts.district_id - Example: Jackson Public has 50+ individual schools

Schools (1) → PerformanceRecords (Many) - One school has records for multiple years/demographics - FK: PerformanceRecords.school_id → Schools.school_id - Example: One school has records for each grade, year, subgroup

DemographicGroups (1) → PerformanceRecords (Many) - One demographic category (e.g., “All Students”) applies to many records - FK: PerformanceRecords.group_id → DemographicGroups.group_id - Example: “All Students” subgroup appears in thousands of records

AcademicYears (1) → PerformanceRecords (Many) - One academic year contains many performance measurements - FK: PerformanceRecords.year_id → AcademicYears.year_id - Example: 2023 school year has 18,630 performance records

Districts (1) → TeacherQuality (Many) - One district has teacher quality data for multiple years - FK: TeacherQuality.district_id → Districts.district_id

Primary Keys: All tables use surrogate keys (table_id) for stability and performance

Foreign Keys List

Districts.location_id → Locations.location_id
Schools.district_id → Districts.district_id
PerformanceRecords.school_id → Schools.school_id
PerformanceRecords.group_id → DemographicGroups.group_id
PerformanceRecords.year_id → AcademicYears.year_id
TeacherQuality.district_id → Districts.district_id
TeacherQuality.year_id → AcademicYears.year_id

ERD Model Overview

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────────┐
│  Locations  │───▷│   Districts  │───▷│   Schools   │───▷│ PerformanceRecords│
│ location_id │ 1:M│ district_id  │ 1:M│ school_id   │ 1:M│   record_id      │
│ county      │    │ district_name│    │ school_name │    │ english_proficiency│
│ city        │    │ location_id  │    │ district_id │    │ school_id        │
│ zip_code    │    └──────────────┘    │ school_type │    │ group_id         │
└─────────────┘                       │ grade_span  │    │ year_id          │
                                      └─────────────┘    └──────────────────┘
                                                                    ▲     ▲
                                                                    │     │
┌─────────────┐                       ┌──────────────┐             │     │
│TeacherQuality│                      │DemographicGroups│ ──────────┘     │
│ quality_id  │                       │  group_id    │ 1:M               │
│ district_id │◁──────────────────────│subgroup_name │                   │
│ year_id     │                       │subgroup_type │                   │
└─────────────┘                       └──────────────┘                   │
      │                                                                  │
      │                               ┌──────────────┐                   │
      └───────────────────────────────│AcademicYears │ ──────────────────┘
                                      │  year_id     │ 1:M
                                      │ school_year  │
                                      └──────────────┘

                    ┌─────────────┐
                    │   Books     │  (Standalone - No FKs)
                    │  book_id    │
                    │   title     │
                    │  author     │
                    │ grade_level │
                    └─────────────┘

📐 2. NORMALIZATION IN DEPTH

1NF (First Normal Form) Analysis

Rule: Each column contains atomic values, no repeating groups

✅ PerformanceRecords Table Complies: - english_proficiency = single percentage value (not “45%, 67%”) - Each performance level has separate columns: performance_level_1_pct, performance_level_2_pct - No arrays or comma-separated values

❌ What Would Violate 1NF:

-- BAD: Multiple values in one column
performance_levels VARCHAR(100) = "Level1:45%,Level2:30%,Level3:25%"

-- BAD: Repeating groups
school_name_1, school_name_2, school_name_3

2NF (Second Normal Form) Analysis

Rule: In 1NF + no partial dependencies on composite keys

✅ All Tables Comply: - Primary keys are single columns (school_id, district_id) - No composite primary keys = no partial dependencies possible - All attributes fully depend on the entire primary key

❌ Example 2NF Violation:

-- BAD: Composite key with partial dependency
CREATE TABLE SchoolEnrollment (
    school_id INT,
    student_id INT,
    district_name VARCHAR(100),  -- Only depends on school_id!
    enrollment_date DATE,
    PRIMARY KEY (school_id, student_id)
);

Problem: district_name only depends on school_id, not the full key

3NF (Third Normal Form) Analysis

Rule: In 2NF + no transitive dependencies

✅ Schools Table Example:

Schools: school_id → school_name, school_type, grade_span, district_id
Districts: district_id → district_name, location_id

school_name depends directly on school_id ✅
District info stored in separate table ✅
No transitive dependency: school_id → district_id → district_name ❌

❌ What Would Violate 3NF:

-- BAD: Storing district info in Schools table
CREATE TABLE Schools (
    school_id INT PRIMARY KEY,
    school_name VARCHAR(255),
    district_id INT,
    district_name VARCHAR(255),  -- Transitive dependency!
    county VARCHAR(100)          -- Transitive dependency!
);

Problem: school_id → district_id → district_name (transitive)

Controlled Denormalization Decision

PerformanceRecords stores both percentage AND count for each performance level

Analysis: - Technically denormalized: Could calculate percentage from count/total - Justified because: - Source data provides both values - Different analytical needs (percentage for comparison, count for totals) - Avoids expensive calculations during queries - Values are atomic and don’t create update anomalies

Alternative (Fully Normalized):

-- Could split into separate table
CREATE TABLE PerformanceLevelDetails (
    record_id INT,
    level_number INT,
    percentage FLOAT,
    count INT,
    PRIMARY KEY (record_id, level_number)
);

Trade-off: More normalized but requires JOINs for every query

🔧 3. SQL vs ORM EXAMPLES

Simple Query Examples

Example 1: Get All Schools in a District

Raw SQL:

SELECT school_name, school_type, grade_span
FROM Schools 
WHERE district_id = 5;

SQLAlchemy ORM:

schools = Schools.query.filter_by(district_id=5).all()
for school in schools:
    print(school.school_name, school.school_type)

Example 2: Get District with Location Info

Raw SQL:

SELECT d.district_name, l.county, l.city
FROM Districts d
JOIN Locations l ON d.location_id = l.location_id
WHERE d.district_id = 10;

SQLAlchemy ORM:

district = Districts.query.filter_by(district_id=10).first()
print(f"{district.district_name} in {district.location.county}")

Example 3: Count Schools per District

Raw SQL:

SELECT d.district_name, COUNT(s.school_id) as school_count
FROM Districts d
LEFT JOIN Schools s ON d.district_id = s.district_id
GROUP BY d.district_id, d.district_name
ORDER BY school_count DESC;

SQLAlchemy ORM:

from sqlalchemy import func

district_counts = db.session.query(
    Districts.district_name,
    func.count(Schools.school_id).label('school_count')
).outerjoin(Schools).group_by(Districts.district_id)\
 .order_by(func.count(Schools.school_id).desc()).all()

SQL vs ORM Trade-offs Analysis

Raw SQL Advantages: - Performance Control: Direct query optimization, no ORM overhead - Full SQL Features: Window functions, CTEs, complex subqueries - Debugging: See exact SQL executed, easier to optimize - Database-Specific: Can use MySQL-specific optimizations

Raw SQL Disadvantages: - Database Coupling: SQL tied to specific database system
- Security Risk: Manual string building can lead to SQL injection - Maintenance: Schema changes require updating multiple SQL files - Type Safety: No compile-time checking of column names/types

ORM (SQLAlchemy) Advantages: - Database Agnostic: Same code works with MySQL, PostgreSQL, SQLite - Type Safety: Python objects with IDE autocomplete and error checking - Relationship Handling: Automatic joins through defined relationships - Migration Support: Schema changes tracked and versioned - Security: Built-in protection against SQL injection

ORM Disadvantages: - Performance Overhead: Additional layer between code and database - Learning Curve: Must understand both Python ORM and underlying SQL - Limited Control: Complex queries may require dropping to raw SQL - Query Complexity: ORM queries can become harder to read than SQL

Best Practice in MS Literacy Database:

CRUD Operations → ORM:

# Simple operations use ORM for safety and maintainability
new_school = Schools(
    school_number=123,
    school_name="New Elementary",
    district_id=district.district_id,
    school_type="Public"
)
db.session.add(new_school)
db.session.commit()

Analytics Queries → Raw SQL:

# Complex analytics use raw SQL for performance
analytics_query = """
    SELECT county, AVG(english_proficiency) as avg_performance
    FROM districts d
    JOIN locations l ON d.location_id = l.location_id
    JOIN schools s ON d.district_id = s.district_id
    JOIN performance_records pr ON s.school_id = pr.school_id
    WHERE pr.group_id = :group_id
    GROUP BY county
    ORDER BY avg_performance DESC
"""
results = db.session.execute(text(analytics_query), {'group_id': 1})

Data Integrity Constraints

-- Foreign key constraints ensure referential integrity
ALTER TABLE PerformanceRecords 
ADD CONSTRAINT fk_performance_school 
FOREIGN KEY (school_id) REFERENCES Schools(school_id) ON DELETE CASCADE;

-- Check constraints validate business rules
ALTER TABLE PerformanceRecords 
ADD CONSTRAINT chk_proficiency_range 
CHECK (english_proficiency >= 0 AND english_proficiency <= 100);

-- NOT NULL constraints ensure required data
ALTER TABLE PerformanceRecords 
MODIFY COLUMN school_id INT NOT NULL;

🏗️ 4. DESIGN DECISIONS

Surrogate Keys Used

Why: Stability, performance, uniqueness guarantee Example: school_id instead of school name (names can change)

ETL Process

Extract: CSV data import
Transform: Handle inconsistent formats (“<10”, “0-10%”)
Load: Batch processing with error handling

🖥️ 5. APPLICATION STACK

Technology Used

Backend: Flask + SQLAlchemy ORM + MySQL
Frontend: JavaScript + Chart.js + Leaflet maps
Deployment: Docker Compose

API Design

REST endpoints: /api/districts, /api/performance
Filtering: Query parameters for data filtering
Response format: JSON with success, data, count

🚀 6. PERFORMANCE

Indexing Strategy

-- Key indexes for performance
CREATE INDEX idx_performance_school ON PerformanceRecords(school_id);
CREATE INDEX idx_schools_district ON Schools(district_id);
CREATE INDEX idx_districts_location ON Districts(location_id);

Query Optimization

Use specific column names, not SELECT *
Filter early with WHERE clauses
Use LIMIT for large result sets

📈 7. EXAM PRACTICE QUESTIONS

Entity Relationships (15 points)

Q1: Explain why the relationship between Schools and PerformanceRecords is 1:Many rather than 1:1. Give a specific example.

Normalization Analysis (20 points)

Q2: Analyze this proposed table design. What normal form violations exist?

CREATE TABLE SchoolSummary (
    school_id INT,
    school_name VARCHAR(255),
    district_name VARCHAR(255),     -- Violation?
    county VARCHAR(100),            -- Violation?
    math_score_2022 FLOAT,
    math_score_2023 FLOAT,         -- Violation?
    english_score_2022 FLOAT,
    english_score_2023 FLOAT      -- Violation?
);

Q4: The PerformanceRecords table stores both percentage and count for each level. Is this normalized? Justify your answer with pros/cons.

SQL vs ORM (25 points)

Q3: Convert this SQL query to SQLAlchemy ORM syntax:

SELECT school_name, school_type 
FROM Schools 
WHERE district_id = 15 AND school_type = 'Elementary';

Q4: Write the SQL equivalent of this ORM query:

books = Books.query.filter_by(grade_level='3rd Grade').order_by(Books.title).all()

Good luck with your database exam! 🍀

Focused study guide covering core database concepts using the Mississippi Literacy Database.