πŸ“š Database Exam Study Guide

Mississippi Literacy Database System


πŸ“– KEY TERMS & DEFINITIONS

Primary Key - Unique identifier for each table row (e.g., school_id) Foreign Key - Links to primary key of another table (e.g., district_id) Entity - Object about which data is stored (Schools, Districts) ERD - Visual diagram showing entities and relationships CRUD - Create, Read, Update, Delete operations cardinality - defines the number of instances of one entity that can be related to instances of another entity, typically categorized as one-to-one (1:1), one-to-many (1:N), or many-to-many (M:N)

1NF - Atomic values, no repeating groups 2NF - 1NF + no partial dependencies
3NF - 2NF + no transitive dependencies Denormalization - Trading normalization for performance

INNER JOIN - Only matching records from both tables LEFT JOIN - All left records + matching right records Aggregate Functions - COUNT(), SUM(), AVG(), MIN(), MAX() GROUP BY - Groups rows for aggregation Subquery - Query inside another query

Index - Improves query performance ETL - Extract, Transform, Load data process ORM - Maps objects to database tables (SQLAlchemy) API - Interface for accessing data (/api/districts)


πŸ—„οΈ 1. DATABASE SCHEMA

Main Entities (8 tables)

  1. Locations - Geographic data (county, city, ZIP)
  2. Districts - 150 MS school districts
  3. Schools - 1,003 individual schools
  4. DemographicGroups - Student subgroup classifications
  5. AcademicYears - Time periods for tracking
  6. PerformanceRecords - 18,630 literacy performance metrics
  7. TeacherQuality - Teacher experience metrics
  8. Books - 259 book recommendations

Entity Relationships (1:M)

Locations (1) β†’ Districts (Many) - One county/city can have multiple school districts - FK: Districts.location_id β†’ Locations.location_id - Example: Harrison County has Biloxi, Gulfport, Long Beach districts

Districts (1) β†’ Schools (Many)
- One district manages multiple schools - FK: Schools.district_id β†’ Districts.district_id - Example: Jackson Public has 50+ individual schools

Schools (1) β†’ PerformanceRecords (Many) - One school has records for multiple years/demographics - FK: PerformanceRecords.school_id β†’ Schools.school_id - Example: One school has records for each grade, year, subgroup

DemographicGroups (1) β†’ PerformanceRecords (Many) - One demographic category (e.g., β€œAll Students”) applies to many records - FK: PerformanceRecords.group_id β†’ DemographicGroups.group_id - Example: β€œAll Students” subgroup appears in thousands of records

AcademicYears (1) β†’ PerformanceRecords (Many) - One academic year contains many performance measurements - FK: PerformanceRecords.year_id β†’ AcademicYears.year_id - Example: 2023 school year has 18,630 performance records

Districts (1) β†’ TeacherQuality (Many) - One district has teacher quality data for multiple years - FK: TeacherQuality.district_id β†’ Districts.district_id

Primary Keys: All tables use surrogate keys (table_id) for stability and performance

Foreign Keys List

  • Districts.location_id β†’ Locations.location_id
  • Schools.district_id β†’ Districts.district_id
  • PerformanceRecords.school_id β†’ Schools.school_id
  • PerformanceRecords.group_id β†’ DemographicGroups.group_id
  • PerformanceRecords.year_id β†’ AcademicYears.year_id
  • TeacherQuality.district_id β†’ Districts.district_id
  • TeacherQuality.year_id β†’ AcademicYears.year_id

ERD Model Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Locations  │───▷│   Districts  │───▷│   Schools   │───▷│ PerformanceRecordsβ”‚
β”‚ location_id β”‚ 1:Mβ”‚ district_id  β”‚ 1:Mβ”‚ school_id   β”‚ 1:Mβ”‚   record_id      β”‚
β”‚ county      β”‚    β”‚ district_nameβ”‚    β”‚ school_name β”‚    β”‚ english_proficiencyβ”‚
β”‚ city        β”‚    β”‚ location_id  β”‚    β”‚ district_id β”‚    β”‚ school_id        β”‚
β”‚ zip_code    β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ school_type β”‚    β”‚ group_id         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚ grade_span  β”‚    β”‚ year_id          β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                    β–²     β–²
                                                                    β”‚     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚     β”‚
β”‚TeacherQualityβ”‚                      β”‚DemographicGroupsβ”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚ quality_id  β”‚                       β”‚  group_id    β”‚ 1:M               β”‚
β”‚ district_id │◁──────────────────────│subgroup_name β”‚                   β”‚
β”‚ year_id     β”‚                       β”‚subgroup_type β”‚                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
      β”‚                                                                  β”‚
      β”‚                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
      └───────────────────────────────│AcademicYears β”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚  year_id     β”‚ 1:M
                                      β”‚ school_year  β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Books     β”‚  (Standalone - No FKs)
                    β”‚  book_id    β”‚
                    β”‚   title     β”‚
                    β”‚  author     β”‚
                    β”‚ grade_level β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ 2. NORMALIZATION IN DEPTH

1NF (First Normal Form) Analysis

Rule: Each column contains atomic values, no repeating groups

βœ… PerformanceRecords Table Complies: - english_proficiency = single percentage value (not β€œ45%, 67%”) - Each performance level has separate columns: performance_level_1_pct, performance_level_2_pct - No arrays or comma-separated values

❌ What Would Violate 1NF:

-- BAD: Multiple values in one column
performance_levels VARCHAR(100) = "Level1:45%,Level2:30%,Level3:25%"

-- BAD: Repeating groups
school_name_1, school_name_2, school_name_3

2NF (Second Normal Form) Analysis

Rule: In 1NF + no partial dependencies on composite keys

βœ… All Tables Comply: - Primary keys are single columns (school_id, district_id) - No composite primary keys = no partial dependencies possible - All attributes fully depend on the entire primary key

❌ Example 2NF Violation:

-- BAD: Composite key with partial dependency
CREATE TABLE SchoolEnrollment (
    school_id INT,
    student_id INT,
    district_name VARCHAR(100),  -- Only depends on school_id!
    enrollment_date DATE,
    PRIMARY KEY (school_id, student_id)
);

Problem: district_name only depends on school_id, not the full key

3NF (Third Normal Form) Analysis

Rule: In 2NF + no transitive dependencies

βœ… Schools Table Example:

Schools: school_id β†’ school_name, school_type, grade_span, district_id
Districts: district_id β†’ district_name, location_id
  • school_name depends directly on school_id βœ…
  • District info stored in separate table βœ…
  • No transitive dependency: school_id β†’ district_id β†’ district_name ❌

❌ What Would Violate 3NF:

-- BAD: Storing district info in Schools table
CREATE TABLE Schools (
    school_id INT PRIMARY KEY,
    school_name VARCHAR(255),
    district_id INT,
    district_name VARCHAR(255),  -- Transitive dependency!
    county VARCHAR(100)          -- Transitive dependency!
);

Problem: school_id β†’ district_id β†’ district_name (transitive)

Controlled Denormalization Decision

PerformanceRecords stores both percentage AND count for each performance level

Analysis: - Technically denormalized: Could calculate percentage from count/total - Justified because: - Source data provides both values - Different analytical needs (percentage for comparison, count for totals) - Avoids expensive calculations during queries - Values are atomic and don’t create update anomalies

Alternative (Fully Normalized):

-- Could split into separate table
CREATE TABLE PerformanceLevelDetails (
    record_id INT,
    level_number INT,
    percentage FLOAT,
    count INT,
    PRIMARY KEY (record_id, level_number)
);

Trade-off: More normalized but requires JOINs for every query


πŸ”§ 3. SQL vs ORM EXAMPLES

Simple Query Examples

Example 1: Get All Schools in a District

Raw SQL:

SELECT school_name, school_type, grade_span
FROM Schools 
WHERE district_id = 5;

SQLAlchemy ORM:

schools = Schools.query.filter_by(district_id=5).all()
for school in schools:
    print(school.school_name, school.school_type)

Example 2: Get District with Location Info

Raw SQL:

SELECT d.district_name, l.county, l.city
FROM Districts d
JOIN Locations l ON d.location_id = l.location_id
WHERE d.district_id = 10;

SQLAlchemy ORM:

district = Districts.query.filter_by(district_id=10).first()
print(f"{district.district_name} in {district.location.county}")

Example 3: Count Schools per District

Raw SQL:

SELECT d.district_name, COUNT(s.school_id) as school_count
FROM Districts d
LEFT JOIN Schools s ON d.district_id = s.district_id
GROUP BY d.district_id, d.district_name
ORDER BY school_count DESC;

SQLAlchemy ORM:

from sqlalchemy import func

district_counts = db.session.query(
    Districts.district_name,
    func.count(Schools.school_id).label('school_count')
).outerjoin(Schools).group_by(Districts.district_id)\
 .order_by(func.count(Schools.school_id).desc()).all()

SQL vs ORM Trade-offs Analysis

Raw SQL Advantages: - Performance Control: Direct query optimization, no ORM overhead - Full SQL Features: Window functions, CTEs, complex subqueries - Debugging: See exact SQL executed, easier to optimize - Database-Specific: Can use MySQL-specific optimizations

Raw SQL Disadvantages: - Database Coupling: SQL tied to specific database system
- Security Risk: Manual string building can lead to SQL injection - Maintenance: Schema changes require updating multiple SQL files - Type Safety: No compile-time checking of column names/types

ORM (SQLAlchemy) Advantages: - Database Agnostic: Same code works with MySQL, PostgreSQL, SQLite - Type Safety: Python objects with IDE autocomplete and error checking - Relationship Handling: Automatic joins through defined relationships - Migration Support: Schema changes tracked and versioned - Security: Built-in protection against SQL injection

ORM Disadvantages: - Performance Overhead: Additional layer between code and database - Learning Curve: Must understand both Python ORM and underlying SQL - Limited Control: Complex queries may require dropping to raw SQL - Query Complexity: ORM queries can become harder to read than SQL

Best Practice in MS Literacy Database:

CRUD Operations β†’ ORM:

# Simple operations use ORM for safety and maintainability
new_school = Schools(
    school_number=123,
    school_name="New Elementary",
    district_id=district.district_id,
    school_type="Public"
)
db.session.add(new_school)
db.session.commit()

Analytics Queries β†’ Raw SQL:

# Complex analytics use raw SQL for performance
analytics_query = """
    SELECT county, AVG(english_proficiency) as avg_performance
    FROM districts d
    JOIN locations l ON d.location_id = l.location_id
    JOIN schools s ON d.district_id = s.district_id
    JOIN performance_records pr ON s.school_id = pr.school_id
    WHERE pr.group_id = :group_id
    GROUP BY county
    ORDER BY avg_performance DESC
"""
results = db.session.execute(text(analytics_query), {'group_id': 1})

Data Integrity Constraints

-- Foreign key constraints ensure referential integrity
ALTER TABLE PerformanceRecords 
ADD CONSTRAINT fk_performance_school 
FOREIGN KEY (school_id) REFERENCES Schools(school_id) ON DELETE CASCADE;

-- Check constraints validate business rules
ALTER TABLE PerformanceRecords 
ADD CONSTRAINT chk_proficiency_range 
CHECK (english_proficiency >= 0 AND english_proficiency <= 100);

-- NOT NULL constraints ensure required data
ALTER TABLE PerformanceRecords 
MODIFY COLUMN school_id INT NOT NULL;

πŸ—οΈ 4. DESIGN DECISIONS

Surrogate Keys Used

Why: Stability, performance, uniqueness guarantee Example: school_id instead of school name (names can change)

ETL Process

  • Extract: CSV data import
  • Transform: Handle inconsistent formats (β€œ<10”, β€œ0-10%”)
  • Load: Batch processing with error handling

πŸ–₯️ 5. APPLICATION STACK

Technology Used

  • Backend: Flask + SQLAlchemy ORM + MySQL
  • Frontend: JavaScript + Chart.js + Leaflet maps
  • Deployment: Docker Compose

API Design

  • REST endpoints: /api/districts, /api/performance
  • Filtering: Query parameters for data filtering
  • Response format: JSON with success, data, count

πŸš€ 6. PERFORMANCE

Indexing Strategy

-- Key indexes for performance
CREATE INDEX idx_performance_school ON PerformanceRecords(school_id);
CREATE INDEX idx_schools_district ON Schools(district_id);
CREATE INDEX idx_districts_location ON Districts(location_id);

Query Optimization

  • Use specific column names, not SELECT *
  • Filter early with WHERE clauses
  • Use LIMIT for large result sets

πŸ“ˆ 7. EXAM PRACTICE QUESTIONS

Entity Relationships (15 points)

Q1: Explain why the relationship between Schools and PerformanceRecords is 1:Many rather than 1:1. Give a specific example.

Normalization Analysis (20 points)

Q2: Analyze this proposed table design. What normal form violations exist?

CREATE TABLE SchoolSummary (
    school_id INT,
    school_name VARCHAR(255),
    district_name VARCHAR(255),     -- Violation?
    county VARCHAR(100),            -- Violation?
    math_score_2022 FLOAT,
    math_score_2023 FLOAT,         -- Violation?
    english_score_2022 FLOAT,
    english_score_2023 FLOAT      -- Violation?
);

Q4: The PerformanceRecords table stores both percentage and count for each level. Is this normalized? Justify your answer with pros/cons.

SQL vs ORM (25 points)

Q3: Convert this SQL query to SQLAlchemy ORM syntax:

SELECT school_name, school_type 
FROM Schools 
WHERE district_id = 15 AND school_type = 'Elementary';

Q4: Write the SQL equivalent of this ORM query:

books = Books.query.filter_by(grade_level='3rd Grade').order_by(Books.title).all()

Good luck with your database exam! πŸ€

Focused study guide covering core database concepts using the Mississippi Literacy Database.