This script loads and validates the structure of a reviewer dataset stored in an Excel file, ensuring column integrity and identifying nulls in key fields.
import pandas as pd
# Load the Excel file
file_path = '/Users/cynthiamcginnis/Downloads/Reviewers_Schema_Prototype.xlsx'
df = pd.read_excel(file_path)
# Display first few rows
print("Preview of data:")
print(df.head())
# Check column names for structure validation
expected_columns = [
"employee_id", "first_name", "last_name", "email", "department", "role",
"expertise_areas", "experience_level", "max_reviews_per_month", "active",
"last_review_date", "total_reviews_completed", "current_workload",
"performance_rating", "notes"
]
print("\nValidating column structure...")
missing_columns = [col for col in expected_columns if col not in df.columns]
extra_columns = [col for col in df.columns if col not in expected_columns]
if not missing_columns and not extra_columns:
print(" 👍🏻 Column structure is valid.")
else:
print(" 👎 Column validation issues detected:")
if missing_columns:
print(f" - Missing columns: {missing_columns}")
if extra_columns:
print(f" - Unexpected columns: {extra_columns}")
# Optional: check for null values in critical columns
critical_columns = ["employee_id", "email", "active"]
print("\nChecking for null values in critical fields...")
for col in critical_columns:
null_count = df[col].isnull().sum()
print(f" - {col}: {null_count} null values")
# Example of business logic validation
print("\nChecking for logical constraints...")
invalid_workloads = df[df["current_workload"] > df["max_reviews_per_month"]]
if not invalid_workloads.empty:
print(f"⚠️ Found {len(invalid_workloads)} rows with workload exceeding max allowed:")
print(invalid_workloads[["employee_id", "current_workload", "max_reviews_per_month"]])
else:
print(" 🫶 All workload constraints satisfied.")
Preview of data:
employee_id first_name last_name email department \
0 EMP001 Breanna King vcox@hotmail.com QA
1 EMP002 Emily Conway zimmermanalexis@hotmail.com Data Science
2 EMP003 Jacob Sullivan jatkins@yahoo.com Engineering
3 EMP004 Christina Yates carrierobinson@salazar.com Engineering
4 EMP005 Jessica Leblanc keith68@walker-young.info Data Science
role expertise_areas experience_level \
0 QA Analyst get, sense, office 4
1 Senior Developer economy, sort, last 4
2 QA Analyst stay, far, strategy 2
3 Senior Developer yard, why, tell 4
4 Junior Developer operation, until, table 5
max_reviews_per_month active last_review_date total_reviews_completed \
0 2 True 2025-02-27 37
1 2 False 2025-03-24 12
2 3 True 2025-02-24 41
3 5 False 2025-02-24 28
4 3 False 2025-04-08 42
current_workload performance_rating \
0 0 2.5
1 1 3.8
2 0 1.6
3 3 3.8
4 4 3.0
notes
0 Voice speak service themselves finally.
1 Guess left various institution.
2 Statement help anything would.
3 Throughout determine guess president manager.
4 Sense indicate their power.
Validating column structure...
👍🏻 Column structure is valid.
Checking for null values in critical fields...
- employee_id: 0 null values
- email: 0 null values
- active: 0 null values
Checking for logical constraints...
⚠️ Found 6 rows with workload exceeding max allowed:
employee_id current_workload max_reviews_per_month
4 EMP005 4 3
6 EMP007 5 4
10 EMP011 3 2
19 EMP020 3 2
26 EMP027 3 2
29 EMP030 4 3