This practicum builds custom Python functions from scratch to perform end-to-end data science on the Student Performance Dataset. All functions are hand-coded without relying on pre-built library shortcuts, covering:
Dataset berisi data akademik dan sosial dari 20 siswa: nilai matematika, membaca, menulis, kehadiran, jam belajar, dan pendidikan orang tua. Dibuat realistis dengan missing values dan outlier yang perlu ditangani.
# ============================================================
# DATASET: Student Performance (20 students, 10 columns)
# id, name, gender, age, math, reading, writing,
# attendance, study_hours, parent_edu
# ============================================================
students = [
[1, "Andi", "M", 16, 78, 82, 80, 90, 3.5, "Bachelor"],
[2, "Budi", "M", 17, 55, 60, 58, 75, 2.0, "High School"],
[3, "Citra", "F", 16, 88, 91, 89, 95, 4.5, "Master"],
[4, "Dian", "F", 17, None, 70, 68, 80, 2.5, "Bachelor"], # missing math
[5, "Eko", "M", 16, 45, None, 50, 65, 1.5, "High School"], # missing reading
[6, "Fitri", "F", 18, 92, 95, 93, 98, 5.0, "Master"],
[7, "Gilang", "M", 17, 63, 68, None, 72, 2.0, "High School"], # missing writing
[8, "Hana", "F", 16, 75, 79, 77, 88, 3.0, "Bachelor"],
[9, "Irfan", "M", 18, 10, 15, 12, 40, 0.5, "High School"], # outlier low
[10, "Julia", "F", 17, 82, 85, 83, 92, 4.0, "Bachelor"],
[11, "Kevin", "M", 16, 70, 73, 71, 85, 3.0, "High School"],
[12, "Lina", "F", 17, 95, 97, 96, 99, 5.5, "Master"],
[13, "Mario", "M", 18, 58, 62, 60, 70, 2.0, "High School"],
[14, "Nadia", "F", 16, 80, 84, 82, 91, 3.5, "Bachelor"],
[15, "Oscar", "M", 17, None, None, None, None,None,"High School"], # many missing
[16, "Putri", "F", 16, 85, 88, 86, 94, 4.0, "Master"],
[17, "Rizky", "M", 18, 67, 70, 68, 78, 2.5, "High School"],
[18, "Sari", "F", 17, 73, 76, 74, 87, 3.0, "Bachelor"],
[19, "Tono", "M", 16, 150, 80, 78, 88, 3.5, "Bachelor"], # outlier math=150
[20, "Una", "F", 18, 60, 65, 63, 74, 2.0, "High School"],
]
columns = ["id","name","gender","age","math","reading",
"writing","attendance","study_hours","parent_edu"]
print("Dataset loaded:", len(students), "students,", len(columns), "variables")| Variable | Type | Description |
|---|---|---|
id | Integer | Unique student identifier |
name | String | Full name of the student |
gender | Categorical | M = Male, F = Female |
age | Integer | Student age in years (16-18) |
math | Float | Mathematics score (0-100) |
reading | Float | Reading score (0-100) |
writing | Float | Writing score (0-100) |
attendance | Float | Attendance percentage (0-100%) |
study_hours | Float | Average daily study hours |
parent_edu | Categorical | Parent education: High School / Bachelor / Master |
Data cleaning mengidentifikasi dan menangani data yang tidak valid, tidak lengkap, atau tidak konsisten. Tanpa ini, hasil analisis bisa bias atau menyesatkan. Tiga masalah utama:
Missing values pada kolom numerik diisi menggunakan nilai rata-rata (mean) dari semua nilai valid di kolom tersebut.
# ============================================================
# FUNCTION 1: Deteksi Missing Values
# ============================================================
def detect_missing(data, columns):
"""
Mendeteksi jumlah dan persentase missing values per kolom.
Args: data=list of lists, columns=list of str
Returns: dict berisi count, percent, indices per kolom
"""
result = {}
n_rows = len(data)
for col_idx, col_name in enumerate(columns):
missing_indices = []
for row_idx, row in enumerate(data):
if row[col_idx] is None:
missing_indices.append(row_idx)
count = len(missing_indices)
percent = round((count / n_rows) * 100, 2)
result[col_name] = {"count": count, "percent": percent, "indices": missing_indices}
return result
# ============================================================
# FUNCTION 2: Imputasi Mean
# ============================================================
def impute_mean(data, columns, target_cols):
"""
Mengisi missing values dengan nilai mean kolom tersebut.
Args: data, columns, target_cols=kolom yang diimputasi
Returns: dataset baru setelah imputasi (deep copy)
"""
import copy
cleaned = copy.deepcopy(data)
for col_name in target_cols:
col_idx = columns.index(col_name)
valid_vals = [row[col_idx] for row in cleaned if row[col_idx] is not None]
mean_val = round(sum(valid_vals) / len(valid_vals), 2)
for row in cleaned:
if row[col_idx] is None:
row[col_idx] = mean_val
return cleaned
# --- Penggunaan ---
numeric_cols = ["math","reading","writing","attendance","study_hours"]
missing_report = detect_missing(students, columns)
print("=== MISSING VALUES REPORT ===")
for col, info in missing_report.items():
if info["count"] > 0:
print(f" {col:15s}: {info['count']} missing ({info['percent']}%) -> rows {info['indices']}")
students_clean = impute_mean(students, columns, numeric_cols)
print("\nImputasi selesai.")Outlier adalah nilai yang sangat jauh dari distribusi umum. IQR menggunakan kuartil untuk menentukan batas wajar suatu nilai.
def get_col_vals(data, columns, col_name):
"""Ambil semua nilai valid (non-None) dari satu kolom."""
idx = columns.index(col_name)
return [row[idx] for row in data if row[idx] is not None]
def compute_quartiles(values):
"""Hitung Q1 dan Q3 via interpolasi linear manual."""
sv = sorted(values)
n = len(sv)
def percentile(p):
pos = (p / 100) * (n - 1)
lo = int(pos); hi = lo + 1
if hi >= n: return sv[lo]
return sv[lo] + (pos - lo) * (sv[hi] - sv[lo])
return percentile(25), percentile(75)
def detect_outliers_iqr(data, columns, col_name):
"""Deteksi outlier dengan metode IQR."""
col_idx = columns.index(col_name)
values = get_col_vals(data, columns, col_name)
q1, q3 = compute_quartiles(values)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
out_rows = []
for i, row in enumerate(data):
val = row[col_idx]
if val is not None and (val < lower or val > upper):
out_rows.append((i, row[1], val))
return {"col": col_name, "Q1": round(q1,2), "Q3": round(q3,2),
"IQR": round(iqr,2), "lower": round(lower,2),
"upper": round(upper,2), "outliers": out_rows}
def remove_outliers(data, columns, col_name):
"""Hapus baris yang mengandung outlier pada kolom tertentu."""
info = detect_outliers_iqr(data, columns, col_name)
bad = {r[0] for r in info["outliers"]}
return [row for i, row in enumerate(data) if i not in bad]
# --- Penggunaan ---
for col in ["math","reading","writing"]:
info = detect_outliers_iqr(students_clean, columns, col)
print(f"[{col.upper():8s}] IQR={info['IQR']:5.1f} | Fence:[{info['lower']},{info['upper']}] | Outliers:{info['outliers']}")
students_final = remove_outliers(students_clean, columns, "math")
print(f"Baris setelah remove outlier: {len(students_final)} (dari {len(students_clean)})")Transformasi mengubah skala, format, atau struktur variabel agar lebih sesuai untuk analisis. Tiga teknik utama: Min-Max Normalization, Label Encoding, dan Feature Engineering.
def minmax_normalize(data, columns, col_name):
"""Normalisasi nilai kolom numerik ke rentang [0, 1]."""
col_idx = columns.index(col_name)
values = [row[col_idx] for row in data if row[col_idx] is not None]
x_min = min(values); x_max = max(values)
normalized = []
for row in data:
val = row[col_idx]
if val is None: normalized.append(None)
elif x_max == x_min: normalized.append(0.0)
else: normalized.append(round((val - x_min)/(x_max - x_min), 4))
return normalized, x_min, x_max
# --- Penggunaan ---
norm_math, mn, mx = minmax_normalize(students_final, columns, "math")
print(f"{'Nama':<10} {'Math (asli)':>12} {'Math (norm)':>12}")
print("-" * 36)
for i, row in enumerate(students_final[:8]):
print(f" {row[1]:<10} {str(row[4]):>12} {str(norm_math[i]):>12}")
print(f"Min={mn}, Max={mx}")def label_encode(data, columns, col_name, custom_order=None):
"""Mengubah variabel kategoris menjadi representasi numerik."""
col_idx = columns.index(col_name)
if custom_order:
categories = custom_order
else:
seen = []
for row in data:
val = row[col_idx]
if val is not None and val not in seen: seen.append(val)
categories = sorted(seen)
mapping = {cat: i for i, cat in enumerate(categories)}
encoded = [mapping.get(row[col_idx], -1) for row in data]
return encoded, mapping
# --- Penggunaan ---
edu_order = ["High School","Bachelor","Master"]
enc_edu, edu_map = label_encode(students_final, columns, "parent_edu", edu_order)
print("Mapping parent_edu:", edu_map)
print(f"{'Nama':<10} {'Parent Edu':<15} {'Encoded':>8}")
for i, row in enumerate(students_final[:8]):
print(f" {row[1]:<10} {row[9]:<15} {enc_edu[i]:>8}")
enc_gender, gender_map = label_encode(students_final, columns, "gender")
print("Mapping gender:", gender_map)parent_edu karena ada hierarki bermakna: High School < Bachelor < Master. Gender menggunakan binary encoding (F=0, M=1) — standar untuk variabel dikotomis.def create_avg_score(data, columns):
"""Feature baru: rata-rata skor dari math, reading, writing."""
mi = columns.index("math"); ri = columns.index("reading"); wi = columns.index("writing")
avg_scores = []
for row in data:
scores = [row[mi], row[ri], row[wi]]
valid = [s for s in scores if s is not None]
avg = round(sum(valid)/len(valid), 2) if valid else None
avg_scores.append(avg)
return avg_scores
def assign_grade(avg_score):
"""Konversi avg_score ke grade huruf. Skala: A(90+) B(80+) C(70+) D(60+) E(<60)"""
if avg_score is None: return "N/A"
elif avg_score >= 90: return "A"
elif avg_score >= 80: return "B"
elif avg_score >= 70: return "C"
elif avg_score >= 60: return "D"
else: return "E"
avg_scores = create_avg_score(students_final, columns)
grades = [assign_grade(s) for s in avg_scores]
print(f"{'Nama':<10} {'Math':>6} {'Read':>6} {'Write':>7} {'Avg':>7} {'Grade':>6}")
print("-" * 50)
for i, row in enumerate(students_final):
print(f" {row[1]:<10} {str(row[4]):>6} {str(row[5]):>6} {str(row[6]):>7} {str(avg_scores[i]):>7} {grades[i]:>6}")Statistik deskriptif merangkum karakteristik utama dataset melalui ukuran tendensi sentral dan penyebaran. Ini adalah langkah pertama wajib sebelum analisis inferensial apapun.
def compute_mean(values):
"""Menghitung rata-rata aritmetika dari list nilai valid."""
valid = [v for v in values if v is not None]
if not valid: return None
return round(sum(valid) / len(valid), 4)
def compute_median(values):
"""Menghitung nilai tengah (median) tanpa library."""
valid = sorted([v for v in values if v is not None])
n = len(valid)
if n == 0: return None
mid = n // 2
if n % 2 == 1: return round(valid[mid], 4)
return round((valid[mid-1] + valid[mid]) / 2, 4)
def compute_mode(values):
"""Menghitung modus (nilai terbanyak) dari list nilai."""
valid = [v for v in values if v is not None]
if not valid: return None
freq = {}
for v in valid: freq[v] = freq.get(v, 0) + 1
max_f = max(freq.values())
return [k for k, f in freq.items() if f == max_f]
# --- Penggunaan ---
numeric_cols = ["math","reading","writing","attendance","study_hours"]
print(f"{'Kolom':<14} {'Mean':>8} {'Median':>8} {'Mode':>18}")
print("-" * 52)
for col in numeric_cols:
vals = get_col_vals(students_final, columns, col)
print(f" {col:<14} {str(compute_mean(vals)):>8} {str(compute_median(vals)):>8} {str(compute_mode(vals)):>18}")import math
def compute_variance(values, sample=True):
"""Menghitung variance. sample=True pakai (n-1), False pakai n."""
valid = [v for v in values if v is not None]
n = len(valid)
if n < 2: return None
mean_val = sum(valid) / n
sum_sq = sum((x - mean_val)**2 for x in valid)
denom = (n - 1) if sample else n
return round(sum_sq / denom, 4)
def compute_std(values, sample=True):
"""Standar deviasi = akar dari variance."""
var = compute_variance(values, sample)
if var is None: return None
return round(math.sqrt(var), 4)
# --- Penggunaan ---
print(f"{'Kolom':<14} {'Variance':>10} {'Std Dev':>10} {'CV (%)':>8}")
print("-" * 46)
for col in numeric_cols:
vals = get_col_vals(students_final, columns, col)
var = compute_variance(vals); std = compute_std(vals); mean = compute_mean(vals)
cv = round((std/mean)*100, 2) if mean else None
print(f" {col:<14} {str(var):>10} {str(std):>10} {str(cv):>8}")study_hours memiliki CV 33.49% — sangat tinggi, menunjukkan pola belajar yang sangat beragam. attendance memiliki CV terendah (9.97%) — kehadiran paling konsisten.def pearson_correlation(x_vals, y_vals):
"""Koefisien korelasi Pearson antara dua variabel."""
pairs = [(x,y) for x,y in zip(x_vals,y_vals) if x is not None and y is not None]
n = len(pairs)
if n < 2: return None
xl = [p[0] for p in pairs]; yl = [p[1] for p in pairs]
xm, ym = sum(xl)/n, sum(yl)/n
num = sum((x-xm)*(y-ym) for x,y in zip(xl,yl))
dx = sum((x-xm)**2 for x in xl); dy = sum((y-ym)**2 for y in yl)
denom = math.sqrt(dx * dy)
if denom == 0: return 0.0
return round(num / denom, 4)
# --- Correlation matrix ---
corr_cols = ["math","reading","writing","attendance","study_hours"]
print("=== PEARSON CORRELATION MATRIX ===")
header = f"{'':<13}" + "".join(f"{c[:7]:>9}" for c in corr_cols)
print(header); print("-" * 60)
for c1 in corr_cols:
row_str = f"{c1:<13}"
for c2 in corr_cols:
vx = [students_final[i][columns.index(c1)] for i in range(len(students_final))]
vy = [students_final[i][columns.index(c2)] for i in range(len(students_final))]
row_str += f" {str(pearson_correlation(vx,vy)):>7}"
print(row_str)Reading-Writing (r=0.994) dan Math-Reading (r=0.982): korelasi sangat kuat — kemampuan literasi dan numerasi berkembang beriringan.
Study_hours berkorelasi kuat dengan semua skor (r≈0.81-0.82) — membuktikan jam belajar adalah prediktor terkuat prestasi akademik.
Attendance berkorelasi sedang (r≈0.70) — penting namun bukan satu-satunya faktor.
def full_summary_report(data, columns, numeric_cols):
"""Laporan statistik lengkap per kolom numerik."""
report = {}
for col in numeric_cols:
vals = get_col_vals(data, columns, col)
q1, q3 = compute_quartiles(vals)
iqr = q3 - q1
report[col] = {
"n": len(vals), "mean": compute_mean(vals),
"median": compute_median(vals), "mode": compute_mode(vals),
"std": compute_std(vals), "var": compute_variance(vals),
"min": min(vals), "max": max(vals),
"range": round(max(vals)-min(vals),4),
"Q1": round(q1,4), "Q3": round(q3,4), "IQR": round(iqr,4)
}
return report
summary = full_summary_report(students_final, columns, numeric_cols)
for col, s in summary.items():
print(f"\n{'='*46}")
print(f" [+] {col.upper()} (n={s['n']})")
print(f" {'-'*44}")
print(f" Mean : {s['mean']} | Median : {s['median']}")
print(f" Std Dev : {s['std']} | Variance: {s['var']}")
print(f" Min : {s['min']} | Max : {s['max']} | Range: {s['range']}")
print(f" Q1 : {s['Q1']} | Q3 : {s['Q3']} | IQR : {s['IQR']}")Math: Range 50 poin (45-95) dengan IQR=22 menunjukkan heterogenitas signifikan. 50% siswa berada antara Q1=63 dan Q3=85.
Reading & Writing: Lebih tinggi dari math (mean 78.44 & 76.53 vs 74.41) — literasi siswa lebih merata dibanding numerasi.
Attendance: Q1=78% — bahkan siswa terendah pun masih hadir 78%. Range 34 menunjukkan ada sebagian kecil yang perlu perhatian.
Study Hours: IQR hanya 2 jam (Q1=2.0, Q3=4.0) — 50% siswa belajar 2-4 jam/hari. Distribusi sehat dan wajar.
| Function | Kategori | Deskripsi |
|---|---|---|
detect_missing() | Data Cleaning | Deteksi jumlah & posisi missing values per kolom |
impute_mean() | Data Cleaning | Imputasi missing values dengan nilai rata-rata |
compute_quartiles() | Data Cleaning | Hitung Q1 & Q3 via interpolasi linear manual |
detect_outliers_iqr() | Data Cleaning | Deteksi outlier menggunakan metode IQR |
remove_outliers() | Data Cleaning | Hapus baris outlier dari dataset |
minmax_normalize() | Transformation | Normalisasi nilai ke rentang [0, 1] |
label_encode() | Transformation | Encoding variabel kategoris ke integer |
create_avg_score() | Feature Engineering | Buat fitur rata-rata skor akademik baru |
assign_grade() | Feature Engineering | Konversi skor ke grade huruf A-E |
compute_mean() | Statistical Analysis | Hitung rata-rata aritmetika |
compute_median() | Statistical Analysis | Hitung nilai tengah distribusi |
compute_mode() | Statistical Analysis | Hitung modus (nilai terbanyak) |
compute_variance() | Statistical Analysis | Hitung variance sampel/populasi |
compute_std() | Statistical Analysis | Hitung standar deviasi |
pearson_correlation() | Statistical Analysis | Hitung korelasi Pearson antar dua variabel |
full_summary_report() | Reporting | Laporan statistik lengkap semua kolom numerik |