FUNCTIONS & LOOPS + ILMU DATA

PRAKTIKUM MINGGU - 5

IDENTITAS MAHASISWA

OCTAVIA MAIA REGO

NIM: 52250077

Mahasiswa Ilmu Data — Institut Teknologi Sains Bandung

PEMROGRAMAN ILMU DATA ILMU DATA PEMROGRAMAN

INSTITUT TEKNOLOGI
SAINS BANDUNG
ITSB

1. Pendahuluan

Tentang Praktikum Ini

Practicum Week-5 explores advanced programming paradigms in R — focusing on building reusable functions, applying nested loops, and constructing real-world data science pipelines. The eight tasks below progressively build toward a full automated KPI Dashboard and report system.

Design multi-layer functions with input validation and nested loops.
Run multi-dataset simulations for sales, company HR, and Monte Carlo methods.
Apply advanced statistical transformations and ggplot2 visualizations.
Build an automated data science workflow culminating in a full KPI Dashboard.
Generate automated HTML reports per company (Bonus Task 8).

Functions

Modular, reusable code blocks with validation and clean interfaces

Nested Loops

Outer and inner loops for multi-dimensional data generation

Simulation

Synthetic data generation for statistical experimentation

Transformation

Min-Max normalization and Z-score feature engineering

ggplot2

Professional multi-layer, faceted, and annotated visualizations

2. Libraries & Setup

Paket R yang Diperlukan

The following packages power all computation, visualization, and reporting in this practicum. Install them once with install.packages() if not already available.

# ============================================================
# LIBRARIES — all packages required for Practicum Week-5
# ============================================================

library(ggplot2)    # Grammar of Graphics — all plots
library(dplyr)      # Data wrangling and transformation
library(tidyr)      # Tidy data reshaping (pivot_wider, etc.)
library(scales)     # Axis label formatting (comma, percent)
library(knitr)      # Table rendering in HTML output
library(kableExtra) # Enhanced kable table styling
library(gridExtra)  # Arrange multiple ggplots in one figure

3. Tugas 1 — Dynamic Multi-Formula Function

Gambaran Umum

Implement compute_formula(x, formula) supporting four formula types: linear, quadratic, cubic, and exponential. A nested loop computes all formulas simultaneously; input validation catches invalid formula names. All four curves are plotted together on a log-y axis for clear comparison.

3.1 Function with Input Validation & Nested Loop

# ============================================================
# TASK 1: Dynamic Multi-Formula Function
# compute_formula(x, formula) supports 4 formula types
# Nested loop: outer = formula type, inner = x values
# Input validation: stops with informative error if invalid
# ============================================================

compute_formula <- function(x, formula) {

  # --- Input Validation ---
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
  if (!formula %in% valid_formulas) {
    stop(paste0(
      "Invalid formula: '", formula, "'. ",
      "Choose one of: ", paste(valid_formulas, collapse = ", ")
    ))
  }

  # Pre-allocate output vector
  result <- numeric(length(x))

  # FOR LOOP: compute formula value for each x
  for (i in seq_along(x)) {
    xi <- x[i]
    result[i] <- switch(formula,
      "linear"      = 2 * xi + 3,           # f(x) = 2x + 3
      "quadratic"   = xi^2 + 2 * xi + 1,    # f(x) = x² + 2x + 1
      "cubic"       = xi^3 - 5 * xi^2 + xi, # f(x) = x³ - 5x² + x
      "exponential" = exp(0.5 * xi)          # f(x) = e^(0.5x)
    )
  }
  return(result)
}

# --- Build tidy data frame for all 4 formulas over x = 1:20 ---
x_seq        <- 1:20
formula_list <- c("linear", "quadratic", "cubic", "exponential")

# OUTER LOOP: formula type | INNER LOOP (implicit via compute_formula): x values
df_formulas <- do.call(rbind, lapply(formula_list, function(f) {
  data.frame(x = x_seq, y = compute_formula(x_seq, f), formula = f)
}))

# Preview sample values
df_formulas |>
  dplyr::filter(x %in% c(1, 5, 10, 20)) |>
  tidyr::pivot_wider(names_from = formula, values_from = y)

3.2 Visualization — Four Formulas on One Graph

# ============================================================
# TASK 1: Visualization — 4 formulas on a log-y scale
# Navy/gold/coral/teal palette; points + lines; log scale
# ============================================================

formula_palette <- c(
  "linear"      = "#F4A72A",
  "quadratic"   = "#00BFA5",
  "cubic"       = "#E85D5D",
  "exponential" = "#7C6FCD"
)

ggplot(df_formulas, aes(x = x, y = y, color = formula)) +
  geom_line(linewidth = 1.6, alpha = 0.90) +
  geom_point(size = 3.0, alpha = 0.85) +
  scale_color_manual(values = formula_palette) +
  scale_y_log10(labels = comma) +
  labs(
    title    = "Task 1 — Dynamic Multi-Formula Comparison (x = 1 to 20)",
    subtitle = "Linear · Quadratic · Cubic · Exponential plotted on a log-y scale",
    x        = "x value",
    y        = "f(x) — logarithmic scale",
    color    = "Formula Type",
    caption  = "Source: Practicum Week-5 — compute_formula()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 15, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "top",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#FAFBFF", color = NA)
  )

Interpretasi The log-y axis enables all four formulas to coexist on one canvas without the exponential curve dwarfing the rest. Linear (gold) grows steadily at f(x) = 2x+3. Quadratic (teal) accelerates moderately as x increases. Cubic (coral) shows the fastest polynomial growth past x=10, though it dips near x=2 due to the negative x² coefficient. Exponential (purple) grows the fastest overall and diverges sharply after x=10. Input validation inside the function ensures that any unsupported formula name triggers a clean error — a critical best practice for production-grade data pipelines.

4. Tugas 2 — Nested Simulation: Sales & Discounts

Gambaran Umum

Build simulate_sales(n_salesperson, days) with a nested helper function apply_discount() for conditional discount logic. The outer loop iterates over salespersons; the inner loop covers each day. Cumulative net sales are tracked and visualized.

4.1 Sales Simulation with Nested Functions & Loops

# ============================================================
# TASK 2: Nested Simulation — Sales & Discounts
# Outer loop: salesperson | Inner loop: day
# Nested helper function: apply_discount(amount)
# ============================================================

set.seed(2025)  # reproducibility

simulate_sales <- function(n_salesperson, days) {

  # --- NESTED HELPER FUNCTION: conditional discount tiers ---
  apply_discount <- function(amount) {
    if      (amount > 850) return(0.15)   # 15%: high-value sale
    else if (amount > 550) return(0.10)   # 10%: mid-value sale
    else                   return(0.05)   # 5%:  low-value sale
  }

  records <- list()
  idx     <- 1

  # OUTER LOOP: each salesperson
  for (sp in 1:n_salesperson) {
    cumulative <- 0

    # INNER LOOP: each day for this salesperson
    for (d in 1:days) {
      amount     <- round(runif(1, 180, 1000), 2)
      disc_rate  <- apply_discount(amount)
      net_amount <- round(amount * (1 - disc_rate), 2)
      cumulative <- cumulative + net_amount

      records[[idx]] <- data.frame(
        sales_id       = sp,
        day            = d,
        sales_amount   = amount,
        discount_rate  = disc_rate,
        net_amount     = net_amount,
        cumulative_net = round(cumulative, 2)
      )
      idx <- idx + 1
    }
  }
  return(dplyr::bind_rows(records))
}

# Run: 5 salespersons × 10 days
sales_df <- simulate_sales(n_salesperson = 5, days = 10)
head(sales_df, 10)

4.2 Summary Statistics per Salesperson

# ============================================================
# TASK 2: Aggregate summary per salesperson
# ============================================================

sales_summary <- sales_df |>
  dplyr::group_by(sales_id) |>
  dplyr::summarise(
    Gross_Sales   = round(sum(sales_amount), 0),
    Net_Revenue   = round(sum(net_amount), 0),
    Avg_Discount  = paste0(round(mean(discount_rate) * 100, 1), "%"),
    Peak_Cumul    = round(max(cumulative_net), 0),
    .groups = "drop"
  )

kable(sales_summary,
      caption = "Task 2 — Salesperson Summary (5 SP × 10 Days)",
      col.names = c("SP ID","Gross Sales","Net Revenue","Avg Discount","Peak Cumul.")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Task 2 — Salesperson Summary (5 SP × 10 Days)
SP ID	Gross Sales	Net Revenue	Avg Discount	Peak Cumul.
1	6512	5829	10%	5829
2	5890	5307	8.5%	5307
3	6117	5508	8.5%	5508
4	5520	4956	8.5%	4956
5	7214	6436	10%	6436

4.3 Cumulative Net Sales — Line Chart

# ============================================================
# TASK 2: Cumulative net sales line chart per salesperson
# ============================================================

sales_df$sales_id <- factor(sales_df$sales_id, labels = paste0("SP-", 1:5))

ggplot(sales_df, aes(x = day, y = cumulative_net, color = sales_id)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3.2, shape = 21, aes(fill = sales_id),
             color = "white", stroke = 1.6) +
  scale_color_manual(values = c("#F4A72A","#E85D5D","#00BFA5","#7C6FCD","#1E88E5")) +
  scale_fill_manual(values  = c("#F4A72A","#E85D5D","#00BFA5","#7C6FCD","#1E88E5")) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title    = "Task 2 — Cumulative Net Sales per Salesperson (10 Days)",
    subtitle = "Discount tiers: >850 → 15% | >550 → 10% | ≤550 → 5%",
    x        = "Day", y = "Cumulative Net Sales",
    color    = "Salesperson", fill = "Salesperson",
    caption  = "Source: Practicum Week-5 — simulate_sales()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#FAFBFF", color = NA)
  )

Interpretasi All five salespersons show consistent cumulative growth over the 10-day window. The divergence between lines reflects random variation in daily sales amounts and the corresponding discount tier applied. Salespersons with frequent high-amount days (>850) receive steeper discounts (15%), which can actually slow net revenue accumulation relative to a salesperson with consistently mid-range amounts. The nested apply_discount() function cleanly separates business logic from the simulation loop — a key software design principle.

5. Tugas 3 — Multi-Level Performance Categorization

Gambaran Umum

Build categorize_performance(sales_amount) that classifies values into five tiers: Excellent, Very Good, Good, Average, Poor. A for loop processes the full vector element-by-element. Output includes a percentage table, bar chart, and pie chart.

5.1 Categorization Function

# ============================================================
# TASK 3: Multi-Level Performance Categorization
# categorize_performance(x) → character vector (5 tiers)
# For loop + nested if-else chain processes each element
# ============================================================

categorize_performance <- function(sales_amount) {
  category <- character(length(sales_amount))  # pre-allocate

  for (i in seq_along(sales_amount)) {
    val <- sales_amount[i]

    if      (val >= 900) category[i] <- "Excellent"   # ≥ 900
    else if (val >= 700) category[i] <- "Very Good"   # 700–899
    else if (val >= 500) category[i] <- "Good"        # 500–699
    else if (val >= 300) category[i] <- "Average"     # 300–499
    else                 category[i] <- "Poor"         # < 300
  }
  return(category)
}

# Apply to 120 simulated uniform sales values
set.seed(2025)
raw_sales  <- round(runif(120, 100, 1000), 2)
categories <- categorize_performance(raw_sales)

perf_df <- data.frame(
  sales_amount = raw_sales,
  category     = factor(categories,
                          levels = c("Poor","Average","Good","Very Good","Excellent"))
)

perf_pct <- perf_df |>
  dplyr::count(category) |>
  dplyr::mutate(
    pct   = round(n / sum(n) * 100, 1),
    label = paste0(n, " (", pct, "%)")
  )

kable(perf_pct, caption = "Task 3 — Category Distribution (n=120 sales records)",
      col.names = c("Category","Count","Percentage (%)","Label")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Task 3 — Category Distribution (n=120 sales records)
Category	Count	Percentage (%)	Label
Poor	28	23.3	28 (23.3%)
Average	21	17.5	21 (17.5%)
Good	26	21.7	26 (21.7%)
Very Good	30	25.0	30 (25%)
Excellent	15	12.5	15 (12.5%)

5.2 Bar Chart — Category Frequency

# ============================================================
# TASK 3: Bar chart — category frequency with navy/gold palette
# ============================================================

tier_colors <- c(
  "Poor"      = "#E85D5D",
  "Average"   = "#F4A72A",
  "Good"      = "#00BFA5",
  "Very Good" = "#1E88E5",
  "Excellent" = "#7C6FCD"
)

ggplot(perf_pct, aes(x = category, y = n, fill = category)) +
  geom_col(width = 0.60, show.legend = FALSE, alpha = 0.88) +
  geom_text(aes(label = label), vjust = -0.45,
            fontface = "bold", size = 4.0, color = "#1A1A2E") +
  scale_fill_manual(values = tier_colors) +
  labs(
    title    = "Task 3 — Performance Category Distribution (Bar Chart)",
    subtitle = "120 simulated sales values drawn from uniform distribution (100–1000)",
    x        = "Performance Category",
    y        = "Number of Records",
    caption  = "Source: Practicum Week-5 — categorize_performance()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title         = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank()
  )

5.3 Pie Chart — Proportion View

# ============================================================
# TASK 3: Pie chart — proportional breakdown by category
# ============================================================

ggplot(perf_pct, aes(x = "", y = n, fill = category)) +
  geom_col(width = 1, color = "white", linewidth = 0.9) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(category, "\n", pct, "%")),
            position = position_stack(vjust = 0.5),
            fontface = "bold", size = 3.8, color = "#1A1A2E") +
  scale_fill_manual(values = tier_colors) +
  labs(
    title   = "Task 3 — Performance Category Distribution (Pie Chart)",
    fill    = "Category",
    caption = "Source: Practicum Week-5 — categorize_performance()"
  ) +
  theme_void() +
  theme(
    plot.title      = element_text(face = "bold", size = 14, hjust = 0.5, color = "#3D2B6B"),
    legend.position = "right"
  )

Interpretasi With 120 values drawn from a uniform distribution spanning 100–1000 (a range of 900), each of the five performance tiers (each covering a 200-unit span) should theoretically capture about 22% of records. The bar and pie charts confirm roughly equal representation across tiers — validating the correctness of the loop-based categorize_performance() logic. In practice, a real sales dataset would exhibit skewed distributions, making the categorization function even more valuable for discovering performance concentration patterns.

6. Tugas 4 — Multi-Company Dataset Simulation

Gambaran Umum

Build generate_company_data(n_company, n_employees) using nested loops (outer: company, inner: employee). Conditional logic flags top performers where KPI > 90. Output includes a per-company summary table and a salary–KPI scatter plot with regression lines.

6.1 Nested Loop Data Generation

# ============================================================
# TASK 4: Multi-Company Dataset — Nested Loops
# Outer loop: company_id | Inner loop: employee within company
# Conditional logic: top_performer flag (KPI > 90)
# ============================================================

set.seed(42)

generate_company_data <- function(n_company, n_employees) {
  depts   <- c("Engineering","Marketing","Finance","Operations","HR")
  records <- list()
  idx     <- 1

  # OUTER LOOP: each company
  for (co in 1:n_company) {
    company_id <- paste0("CO-", LETTERS[co])  # CO-A, CO-B, ...

    # INNER LOOP: each employee in this company
    for (emp in 1:n_employees) {
      salary            <- round(runif(1, 4500, 16000), 0)
      performance_score <- round(runif(1, 55, 100), 1)
      KPI_score         <- round(runif(1, 50, 100), 1)
      department        <- sample(depts, 1)

      # Conditional flag: top performer if KPI > 90
      top_performer <- ifelse(KPI_score > 90, "Yes", "No")

      records[[idx]] <- data.frame(
        company_id        = company_id,
        employee_id       = paste0(company_id, "-EMP", sprintf("%03d", emp)),
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer
      )
      idx <- idx + 1
    }
  }
  return(dplyr::bind_rows(records))
}

# Generate: 3 companies × 40 employees each
company_df <- generate_company_data(n_company = 3, n_employees = 40)
cat(sprintf("Dataset: %d employees across %d companies\n",
            nrow(company_df), length(unique(company_df$company_id))))

Dataset: 120 employees across 3 companies

head(company_df, 8)

6.2 Summary per Company

# ============================================================
# TASK 4: Company-level aggregate summary
# ============================================================

company_summary <- company_df |>
  dplyr::group_by(company_id) |>
  dplyr::summarise(
    Employees      = n(),
    Avg_Salary     = formatC(round(mean(salary), 0), big.mark = ",", format = "d"),
    Avg_Perf       = round(mean(performance_score), 2),
    Avg_KPI        = round(mean(KPI_score), 2),
    Max_KPI        = round(max(KPI_score), 1),
    Top_Performers = sum(top_performer == "Yes"),
    .groups = "drop"
  )

kable(company_summary,
      caption = "Task 4 — Multi-Company Summary (3 Companies × 40 Employees)",
      col.names = c("Company","Employees","Avg Salary","Avg Perf","Avg KPI","Max KPI","Top Performers")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)

Task 4 — Multi-Company Summary (3 Companies × 40 Employees)
Company	Employees	Avg Salary	Avg Perf	Avg KPI	Max KPI	Top Performers
CO-A	40	10,715	80.54	77.20	98.9	6
CO-B	40	9,964	77.03	70.45	99.0	4
CO-C	40	10,067	76.68	72.12	99.0	7

6.3 Scatter Plot — Salary vs KPI by Company

# ============================================================
# TASK 4: Scatter + regression line, colored by company
# Top performers marked with a distinct shape (triangle)
# ============================================================

ggplot(company_df, aes(x = KPI_score, y = salary,
                        color = company_id,
                        shape = top_performer)) +
  geom_point(size = 3.0, alpha = 0.72) +
  geom_smooth(aes(group = company_id), method = "lm",
              se = FALSE, linewidth = 1.1, alpha = 0.85) +
  scale_color_manual(values = c("CO-A" = "#F4A72A",
                                 "CO-B" = "#E85D5D",
                                 "CO-C" = "#00BFA5")) +
  scale_shape_manual(values = c("No" = 16, "Yes" = 17)) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Task 4 — Salary vs KPI Score by Company",
    subtitle = "Triangles = Top Performers (KPI > 90) | Lines = OLS regression per company",
    x        = "KPI Score", y = "Monthly Salary",
    color    = "Company", shape = "Top Performer",
    caption  = "Source: Practicum Week-5 — generate_company_data()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#FAFBFF", color = NA)
  )

Interpretasi The scatter plot shows that salary and KPI score have near-zero linear correlation across all three companies — the regression lines are nearly flat. This reflects the independent generation of these two variables in the simulation. Top performers (triangles) appear at all salary levels, suggesting that high KPI achievement is not compensated by proportionally higher pay in this dataset. In a real-world HR context, this misalignment would flag an urgent review of the compensation structure relative to performance incentive design.

7. Tugas 5 — Monte Carlo Simulation: Pi & Probability

Gambaran Umum

Build monte_carlo_pi(n_points) that estimates π by sampling random points in the unit square and testing whether they fall inside the unit circle. A secondary analysis computes the probability of points landing inside a sub-square region. All logic uses a for loop.

7.1 Monte Carlo Function

# ============================================================
# TASK 5: Monte Carlo Pi Estimation
# monte_carlo_pi(n_points) → list(pi_estimate, prob, coords)
# FOR LOOP: classify each point inside/outside unit circle
# ============================================================

set.seed(123)  # fixed seed for reproducibility

monte_carlo_pi <- function(n_points) {
  x_pts <- runif(n_points, -1, 1)
  y_pts <- runif(n_points, -1, 1)

  in_circle    <- numeric(n_points)
  in_subsquare <- numeric(n_points)

  # FOR LOOP: classify each point
  for (i in 1:n_points) {

    # Unit circle check: x² + y² ≤ 1
    if (x_pts[i]^2 + y_pts[i]^2 <= 1) {
      in_circle[i] <- 1
    }

    # Sub-square check: both |x| ≤ 0.5 and |y| ≤ 0.5
    # Theoretical probability: (1×1) / (2×2) = 0.25
    if (abs(x_pts[i]) <= 0.5 && abs(y_pts[i]) <= 0.5) {
      in_subsquare[i] <- 1
    }
  }

  pi_est    <- 4 * sum(in_circle) / n_points
  prob_sub  <- sum(in_subsquare) / n_points

  return(list(
    pi_estimate    = pi_est,
    prob_subsquare = prob_sub,
    x              = x_pts,
    y              = y_pts,
    in_circle      = in_circle
  ))
}

# Run with 6,000 points
mc <- monte_carlo_pi(6000)

cat("=== Monte Carlo Results (n = 6,000) ===\n")

=== Monte Carlo Results (n = 6,000) ===

cat(sprintf("  Estimated π       : %.6f\n", mc$pi_estimate))

  Estimated π       : 3.146000

cat(sprintf("  True π (built-in) : %.6f\n", pi))

  True π (built-in) : 3.141593

cat(sprintf("  Absolute Error    : %.6f\n", abs(mc$pi_estimate - pi)))

  Absolute Error    : 0.004407

cat(sprintf("  Error (%%)         : %.4f%%\n", abs(mc$pi_estimate - pi)/pi*100))

  Error (%)         : 0.1403%

cat(sprintf("\n  Sub-square prob   : %.4f\n", mc$prob_subsquare))


  Sub-square prob   : 0.2617

cat(sprintf("  Expected (1/4)    : 0.2500\n"))

  Expected (1/4)    : 0.2500

7.2 Visualization — Points Inside vs Outside Circle

# ============================================================
# TASK 5: Plot — Monte Carlo sampling scatter
# Gold = inside circle | Coral = outside circle
# Navy circle boundary | Purple dashed sub-square
# ============================================================

mc_df <- data.frame(
  x         = mc$x,
  y         = mc$y,
  in_circle = factor(mc$in_circle,
                      levels = c(0, 1),
                      labels = c("Outside Circle","Inside Circle"))
)

ggplot(mc_df, aes(x = x, y = y, color = in_circle)) +
  geom_point(size = 0.60, alpha = 0.50) +
  annotate("path",
           x = cos(seq(0, 2 * pi, length.out = 300)),
           y = sin(seq(0, 2 * pi, length.out = 300)),
           color = "#3D2B6B", linewidth = 1.4) +
  annotate("rect", xmin = -0.5, xmax = 0.5, ymin = -0.5, ymax = 0.5,
           color = "#7C6FCD", fill = NA, linewidth = 1.2, linetype = "dashed") +
  scale_color_manual(values = c("Outside Circle" = "#E85D5D",
                                 "Inside Circle"  = "#F4A72A")) +
  coord_fixed() +
  labs(
    title    = "Task 5 — Monte Carlo π Estimation (n = 6,000)",
    subtitle = paste0("Estimated π = ", round(mc$pi_estimate, 5),
                      "  |  Sub-square hit rate = ",
                      round(mc$prob_subsquare, 4)),
    x        = "x coordinate", y = "y coordinate",
    color    = "Point Region",
    caption  = "Navy = unit circle boundary | Purple dashed = sub-square [-0.5, 0.5]²"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "bottom",
    panel.grid.minor = element_blank()
  )

Interpretasi With 6,000 random points, the Monte Carlo simulation converges to a π estimate within a small margin of error. The geometric principle: the unit circle has area π while the enclosing 2×2 square has area 4, so the proportion of points inside the circle approaches π/4 — multiply by 4 to recover π. The sub-square (purple dashed) captures approximately 25% of all points, matching its theoretical area fraction of 1/4. As n_points increases, both estimates converge toward their true mathematical values — a fundamental property of stochastic sampling.

8. Tugas 6 — Data Transformation & Feature Engineering

Gambaran Umum

Build normalize_columns(df) (Min-Max) and z_score(df) (Z-Score Standardization) using loop-based column iteration. Then engineer two new categorical features. Compare salary distributions before and after transformation with faceted histograms and a violin+boxplot.

8.1 Transformation Functions

# ============================================================
# TASK 6: Loop-based normalization & standardization
# normalize_columns(df) → Min-Max scaling to [0, 1]
# z_score(df)           → Mean=0, SD=1 standardization
# Both iterate over column names with a for loop
# ============================================================

# --- Function 1: Min-Max Normalization ---
normalize_columns <- function(df) {
  num_cols <- names(df)[sapply(df, is.numeric)]
  result   <- df

  for (col in num_cols) {                          # FOR LOOP over columns
    lo          <- min(df[[col]], na.rm = TRUE)
    hi          <- max(df[[col]], na.rm = TRUE)
    result[[col]] <- (df[[col]] - lo) / (hi - lo) # Min-Max formula
  }
  return(result)
}

# --- Function 2: Z-Score Standardization ---
z_score <- function(df) {
  num_cols <- names(df)[sapply(df, is.numeric)]
  result   <- df

  for (col in num_cols) {                            # FOR LOOP over columns
    mu          <- mean(df[[col]], na.rm = TRUE)
    sigma       <- sd(df[[col]],   na.rm = TRUE)
    result[[col]] <- (df[[col]] - mu) / sigma        # Z-Score formula
  }
  return(result)
}

# Apply both functions to numeric columns from Task 4 dataset
company_num    <- company_df |> dplyr::select(salary, performance_score, KPI_score)
company_norm   <- normalize_columns(company_num)
company_zscore <- z_score(company_num)

cat("--- ORIGINAL ---\n")

--- ORIGINAL ---

summary(company_num)

     salary      performance_score   KPI_score    
 Min.   : 4527   Min.   :55.00     Min.   :50.20  
 1st Qu.: 7788   1st Qu.:63.12     1st Qu.:60.40  
 Median : 9993   Median :79.20     Median :71.75  
 Mean   :10249   Mean   :78.09     Mean   :73.26  
 3rd Qu.:13049   3rd Qu.:90.40     3rd Qu.:86.53  
 Max.   :15872   Max.   :99.60     Max.   :99.00

cat("\n--- MIN-MAX NORMALIZED [0,1] ---\n")


--- MIN-MAX NORMALIZED [0,1] ---

summary(company_norm)

     salary       performance_score   KPI_score     
 Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
 1st Qu.:0.2874   1st Qu.:0.1822    1st Qu.:0.2090  
 Median :0.4818   Median :0.5426    Median :0.4416  
 Mean   :0.5043   Mean   :0.5177    Mean   :0.4725  
 3rd Qu.:0.7512   3rd Qu.:0.7937    3rd Qu.:0.7444  
 Max.   :1.0000   Max.   :1.0000    Max.   :1.0000

cat("\n--- Z-SCORE STANDARDIZED (mean≈0, sd=1) ---\n")


--- Z-SCORE STANDARDIZED (mean≈0, sd=1) ---

summary(company_zscore)

     salary         performance_score    KPI_score      
 Min.   :-1.74920   Min.   :-1.64529   Min.   :-1.5877  
 1st Qu.:-0.75232   1st Qu.:-1.06628   1st Qu.:-0.8853  
 Median :-0.07812   Median : 0.07928   Median :-0.1038  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.85624   3rd Qu.: 0.87743   3rd Qu.: 0.9136  
 Max.   : 1.71921   Max.   : 1.53305   Max.   : 1.7726

8.2 Feature Engineering

# ============================================================
# TASK 6: Feature Engineering — 2 new categorical columns
# performance_category: Low / Medium / High
# salary_bracket:       Entry / Mid / Senior
# ============================================================

company_df <- company_df |>
  dplyr::mutate(
    performance_category = dplyr::case_when(
      performance_score >= 87 ~ "High",
      performance_score >= 72 ~ "Medium",
      TRUE                    ~ "Low"
    ),
    salary_bracket = dplyr::case_when(
      salary >= 13000 ~ "Senior",
      salary >= 8000  ~ "Mid",
      TRUE            ~ "Entry"
    )
  )

cat("=== New Feature: performance_category ===\n")

=== New Feature: performance_category ===

print(table(company_df$performance_category))


  High    Low Medium 
    40     43     37

cat("\n=== New Feature: salary_bracket ===\n")


=== New Feature: salary_bracket ===

print(table(company_df$salary_bracket))


 Entry    Mid Senior 
    33     56     31

8.3 Faceted Histograms — Before vs After Transformation

# ============================================================
# TASK 6: Faceted histogram — salary in 3 forms
# ============================================================

hist_df <- dplyr::bind_rows(
  data.frame(value = company_num$salary,    type = "1. Original"),
  data.frame(value = company_norm$salary,   type = "2. Min-Max [0,1]"),
  data.frame(value = company_zscore$salary, type = "3. Z-Score")
)

ggplot(hist_df, aes(x = value, fill = type)) +
  geom_histogram(bins = 18, color = "white", alpha = 0.88) +
  facet_wrap(~type, scales = "free_x", nrow = 1) +
  scale_fill_manual(values = c(
    "1. Original"    = "#F4A72A",
    "2. Min-Max [0,1]" = "#00BFA5",
    "3. Z-Score"     = "#7C6FCD"
  )) +
  labs(
    title    = "Task 6 — Salary Distribution: Original vs Transformed",
    subtitle = "Shape preserved across all three — only the axis scale changes",
    x = "Value", y = "Count",
    caption  = "Source: Practicum Week-5 — normalize_columns() & z_score()"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title       = element_text(face = "bold", size = 13, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "none",
    strip.text       = element_text(face = "bold", size = 10),
    panel.grid.minor = element_blank()
  )

8.4 Violin + Boxplot — Salary by Performance Category

# ============================================================
# TASK 6: Violin + Boxplot — salary by performance category
# ============================================================

company_df$performance_category <- factor(
  company_df$performance_category, levels = c("Low","Medium","High")
)

ggplot(company_df, aes(x = performance_category, y = salary,
                        fill = performance_category)) +
  geom_violin(alpha = 0.50, trim = FALSE) +
  geom_boxplot(width = 0.18, alpha = 0.88, outlier.size = 3,
               outlier.color = "#E85D5D", color = "#1A1A2E") +
  geom_jitter(width = 0.06, alpha = 0.40, size = 2, color = "#455A64") +
  scale_fill_manual(values = c("Low" = "#E85D5D",
                                "Medium" = "#F4A72A",
                                "High"   = "#00BFA5")) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Task 6 — Salary Distribution by Performance Category",
    subtitle = "Violin = distribution shape | Box = median & IQR | Dots = individual employees",
    x = "Performance Category", y = "Monthly Salary", fill = "Category",
    caption  = "Source: Practicum Week-5 — Feature Engineering"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "none",
    panel.grid.minor = element_blank()
  )

Interpretasi The three faceted histograms confirm a fundamental principle: normalization preserves the shape of the distribution — only the scale changes. Min-Max compression maps everything to [0, 1] for easy cross-column comparison. Z-Score standardization centers the mean at 0 with unit standard deviation — essential for distance-based ML algorithms like k-NN and SVM. The violin+boxplot shows broadly similar salary ranges across all three performance categories, which in a real HR setting would signal a disconnect between pay and performance — a key finding for compensation strategy.

9. Tugas 7 — Mini Project: Company KPI Dashboard

Gambaran Umum

Generate a full dataset for 5 companies × 60 employees each (300 rows). Use a for loop to classify employees into 4 KPI tiers. Produce: a summary table, grouped bar chart, department scatter, and salary area chart.

9.1 Generate Dashboard Dataset

# ============================================================
# TASK 7: Full KPI Dashboard — 5 companies × 60 employees
# Reuses generate_company_data() from Task 4
# FOR LOOP: classifies each employee into a KPI tier
# ============================================================

set.seed(77)
dashboard_df <- generate_company_data(n_company = 5, n_employees = 60)

# --- FOR LOOP: KPI tier classification ---
kpi_tier <- character(nrow(dashboard_df))

for (i in 1:nrow(dashboard_df)) {
  kpi <- dashboard_df$KPI_score[i]

  if      (kpi >= 90) kpi_tier[i] <- "Elite"      # Top achievers
  else if (kpi >= 75) kpi_tier[i] <- "Solid"       # Consistent performers
  else if (kpi >= 60) kpi_tier[i] <- "Growing"     # Developing
  else                kpi_tier[i] <- "At Risk"      # Needs support
}

dashboard_df$kpi_tier <- factor(kpi_tier,
                                 levels = c("At Risk","Growing","Solid","Elite"))

cat(sprintf("Dashboard dataset: %d employees, %d companies\n",
            nrow(dashboard_df), length(unique(dashboard_df$company_id))))

Dashboard dataset: 300 employees, 5 companies

head(dashboard_df, 6)

9.2 Company Summary Table

# ============================================================
# TASK 7: Aggregate KPI summary per company
# ============================================================

dashboard_summary <- dashboard_df |>
  dplyr::group_by(company_id) |>
  dplyr::summarise(
    Avg_Salary    = formatC(round(mean(salary), 0), big.mark = ",", format = "d"),
    Avg_KPI       = round(mean(KPI_score), 2),
    Top_Performers= sum(top_performer == "Yes"),
    Elite         = sum(kpi_tier == "Elite"),
    Growing       = sum(kpi_tier == "Growing"),
    At_Risk       = sum(kpi_tier == "At Risk"),
    .groups = "drop"
  )

kable(dashboard_summary,
      caption = "Task 7 — KPI Dashboard: Company Summary (5 Companies × 60 Employees)") |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)

Task 7 — KPI Dashboard: Company Summary (5 Companies × 60 Employees)
company_id	Avg_Salary	Avg_KPI	Top_Performers	Elite	Growing	At_Risk
CO-A	10,281	73.80	13	13	19	15
CO-B	10,114	71.11	9	9	18	21
CO-C	10,538	75.46	14	14	18	13
CO-D	10,373	78.98	10	10	15	6
CO-E	10,629	77.84	16	16	12	11

9.3 Grouped Bar Chart — KPI Tier per Company

# ============================================================
# TASK 7: Grouped bar chart — KPI tier count per company
# ============================================================

kpi_palette <- c(
  "At Risk" = "#E85D5D",
  "Growing" = "#F4A72A",
  "Solid"   = "#1E88E5",
  "Elite"   = "#7C6FCD"
)

tier_counts <- dashboard_df |>
  dplyr::count(company_id, kpi_tier)

ggplot(tier_counts, aes(x = company_id, y = n, fill = kpi_tier)) +
  geom_col(position = "dodge", width = 0.75, alpha = 0.90) +
  geom_text(aes(label = n), position = position_dodge(width = 0.75),
            vjust = -0.4, fontface = "bold", size = 3.5) +
  scale_fill_manual(values = kpi_palette) +
  labs(
    title    = "Task 7 — KPI Tier Distribution per Company (Grouped Bar)",
    subtitle = "5 companies × 60 employees; 4 KPI performance tiers",
    x        = "Company", y = "Employee Count", fill = "KPI Tier",
    caption  = "Source: Practicum Week-5 — KPI Dashboard"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "top",
    panel.grid.minor = element_blank()
  )

9.4 Faceted Scatter — KPI vs Salary by Company

# ============================================================
# TASK 7: Faceted scatter + regression lines per company
# Color = KPI tier; shape = top_performer status
# ============================================================

ggplot(dashboard_df, aes(x = KPI_score, y = salary,
                           color = kpi_tier, shape = top_performer)) +
  geom_point(size = 2.5, alpha = 0.72) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8,
              color = "#3D2B6B", alpha = 0.70) +
  scale_color_manual(values = kpi_palette) +
  scale_shape_manual(values = c("No" = 16, "Yes" = 17)) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~company_id, nrow = 2) +
  labs(
    title    = "Task 7 — KPI Score vs Salary by Company (Faceted)",
    subtitle = "Triangles = Top Performers (KPI > 90) | Regression lines per facet",
    x        = "KPI Score", y = "Monthly Salary",
    color    = "KPI Tier", shape = "Top Performer",
    caption  = "Source: Practicum Week-5 — KPI Dashboard (n=300)"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title       = element_text(face = "bold", size = 13, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 9),
    legend.position  = "right",
    strip.text       = element_text(face = "bold", color = "#0F3460"),
    panel.grid.minor = element_blank()
  )

9.5 Salary Density by KPI Tier (Area Chart)

# ============================================================
# TASK 7: Overlapping area density chart — salary by KPI tier
# ============================================================

ggplot(dashboard_df, aes(x = salary, fill = kpi_tier, color = kpi_tier)) +
  geom_density(alpha = 0.30, linewidth = 0.9) +
  scale_fill_manual(values  = kpi_palette) +
  scale_color_manual(values = kpi_palette) +
  scale_x_continuous(labels = comma) +
  labs(
    title    = "Task 7 — Salary Density by KPI Tier (Area Chart)",
    subtitle = "Overlapping densities reveal salary spread within each KPI tier",
    x        = "Monthly Salary", y = "Density",
    fill     = "KPI Tier", color = "KPI Tier",
    caption  = "Source: Practicum Week-5 — KPI Dashboard (n=300)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#3D2B6B"),
    plot.subtitle    = element_text(color = "#455A64", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank()
  )

Interpretasi The KPI Dashboard integrates all prior tasks into a complete end-to-end pipeline. The grouped bar chart shows broadly similar tier proportions across all five companies — expected given the shared random generation function. The faceted scatter plots confirm near-zero correlation between KPI score and salary across every company, reinforcing the same finding from Task 4 at greater scale. The area density chart reveals that salary distributions overlap substantially across all four KPI tiers — confirming the persistent misalignment between performance classification and compensation in this simulated workforce.

10. Task 8 — Automated Report Generation BONUS

Gambaran Umum

Implement an automated report generation pipeline using functions + loops. A single generate_report() function encapsulates all per-company logic; a for loop calls it for every company automatically — producing structured text summaries, CSV export, and a grid of auto-generated mini-plots.

10.1 Automated Report Function

# ============================================================
# TASK 8 (BONUS): Automated Report Generation
# generate_report(cid, df) → prints full company summary
# Called inside a for loop — no manual repetition
# ============================================================

generate_report <- function(cid, df) {
  cdata       <- df |> dplyr::filter(company_id == cid)
  n_emp       <- nrow(cdata)
  avg_salary  <- round(mean(cdata$salary), 0)
  avg_kpi     <- round(mean(cdata$KPI_score), 2)
  avg_perf    <- round(mean(cdata$performance_score), 2)
  n_top       <- sum(cdata$top_performer == "Yes")
  pct_top     <- round(n_top / n_emp * 100, 1)
  tier_tbl    <- table(cdata$kpi_tier)

  dept_summary <- cdata |>
    dplyr::group_by(department) |>
    dplyr::summarise(Count = n(), Avg_KPI = round(mean(KPI_score), 1), .groups = "drop") |>
    dplyr::arrange(dplyr::desc(Avg_KPI))

  top3 <- cdata |>
    dplyr::arrange(dplyr::desc(KPI_score)) |>
    dplyr::select(employee_id, department, KPI_score, salary) |>
    head(3)

  cat(rep("", 62), "\n", sep = "")
  cat(sprintf("  AUTOMATED REPORT    COMPANY %s\n", cid))
  cat(rep("", 62), "\n\n", sep = "")
  cat(sprintf("  Total Employees    : %d\n", n_emp))
  cat(sprintf("  Avg Monthly Salary : IDR %s\n",
              formatC(avg_salary, big.mark = ",", format = "d")))
  cat(sprintf("  Avg KPI Score      : %.2f\n", avg_kpi))
  cat(sprintf("  Avg Perf Score     : %.2f\n", avg_perf))
  cat(sprintf("  Top Performers     : %d  (%.1f%% of workforce)\n\n", n_top, pct_top))

  cat("   KPI TIER BREAKDOWN \n")
  for (tier in names(tier_tbl)) {
    bar <- paste0(rep("", round(tier_tbl[[tier]] / n_emp * 28)), collapse = "")
    cat(sprintf("  %-10s : %2d employees  %s\n", tier, tier_tbl[[tier]], bar))
  }

  cat("\n   DEPARTMENT SUMMARY \n")
  for (r in 1:nrow(dept_summary)) {
    cat(sprintf("  %-14s : %2d emp  |  Avg KPI = %.1f\n",
                dept_summary$department[r], dept_summary$Count[r],
                dept_summary$Avg_KPI[r]))
  }

  cat("\n   TOP 3 PERFORMERS \n")
  for (r in 1:nrow(top3)) {
    cat(sprintf("  #%d  %-16s | Dept: %-14s | KPI: %.1f | Salary: IDR %s\n",
                r, top3$employee_id[r], top3$department[r],
                top3$KPI_score[r],
                formatC(top3$salary[r], big.mark = ",", format = "d")))
  }
  cat("\n")
}

10.2 Loop — Run Report for All Companies

# ============================================================
# TASK 8 (BONUS): FOR LOOP — generate report for every company
# ============================================================

all_companies <- sort(unique(dashboard_df$company_id))

for (cid in all_companies) {
  generate_report(cid, dashboard_df)
}


  AUTOMATED REPORT    COMPANY CO-A


  Total Employees    : 60
  Avg Monthly Salary : IDR 10,281
  Avg KPI Score      : 73.80
  Avg Perf Score     : 78.05
  Top Performers     : 13  (21.7% of workforce)

   KPI TIER BREAKDOWN 
  At Risk    : 15 employees  
  Growing    : 19 employees  
  Solid      : 13 employees  
  Elite      : 13 employees  

   DEPARTMENT SUMMARY 
  HR             : 13 emp  |  Avg KPI = 75.0
  Marketing      : 10 emp  |  Avg KPI = 74.4
  Engineering    : 13 emp  |  Avg KPI = 74.1
  Finance        : 17 emp  |  Avg KPI = 74.0
  Operations     :  7 emp  |  Avg KPI = 69.8

   TOP 3 PERFORMERS 
  #1  CO-A-EMP015      | Dept: Engineering    | KPI: 96.7 | Salary: IDR 15,272
  #2  CO-A-EMP008      | Dept: HR             | KPI: 96.0 | Salary: IDR 9,534
  #3  CO-A-EMP013      | Dept: Finance        | KPI: 95.1 | Salary: IDR 15,614


  AUTOMATED REPORT    COMPANY CO-B


  Total Employees    : 60
  Avg Monthly Salary : IDR 10,114
  Avg KPI Score      : 71.11
  Avg Perf Score     : 77.75
  Top Performers     : 9  (15.0% of workforce)

   KPI TIER BREAKDOWN 
  At Risk    : 21 employees  
  Growing    : 18 employees  
  Solid      : 12 employees  
  Elite      :  9 employees  

   DEPARTMENT SUMMARY 
  Marketing      : 14 emp  |  Avg KPI = 73.1
  Finance        :  5 emp  |  Avg KPI = 72.9
  HR             : 12 emp  |  Avg KPI = 71.5
  Engineering    : 11 emp  |  Avg KPI = 71.2
  Operations     : 18 emp  |  Avg KPI = 68.8

   TOP 3 PERFORMERS 
  #1  CO-B-EMP028      | Dept: Operations     | KPI: 99.9 | Salary: IDR 13,965
  #2  CO-B-EMP055      | Dept: Operations     | KPI: 99.6 | Salary: IDR 10,350
  #3  CO-B-EMP059      | Dept: Marketing      | KPI: 96.3 | Salary: IDR 14,422


  AUTOMATED REPORT    COMPANY CO-C


  Total Employees    : 60
  Avg Monthly Salary : IDR 10,538
  Avg KPI Score      : 75.46
  Avg Perf Score     : 79.04
  Top Performers     : 14  (23.3% of workforce)

   KPI TIER BREAKDOWN 
  At Risk    : 13 employees  
  Growing    : 18 employees  
  Solid      : 15 employees  
  Elite      : 14 employees  

   DEPARTMENT SUMMARY 
  HR             :  9 emp  |  Avg KPI = 83.7
  Marketing      :  9 emp  |  Avg KPI = 79.5
  Engineering    : 13 emp  |  Avg KPI = 75.9
  Finance        : 13 emp  |  Avg KPI = 71.7
  Operations     : 16 emp  |  Avg KPI = 71.3

   TOP 3 PERFORMERS 
  #1  CO-C-EMP052      | Dept: HR             | KPI: 100.0 | Salary: IDR 6,816
  #2  CO-C-EMP045      | Dept: Marketing      | KPI: 99.7 | Salary: IDR 10,957
  #3  CO-C-EMP016      | Dept: Engineering    | KPI: 99.4 | Salary: IDR 8,941


  AUTOMATED REPORT    COMPANY CO-D


  Total Employees    : 60
  Avg Monthly Salary : IDR 10,373
  Avg KPI Score      : 78.98
  Avg Perf Score     : 77.19
  Top Performers     : 10  (16.7% of workforce)

   KPI TIER BREAKDOWN 
  At Risk    :  6 employees  
  Growing    : 15 employees  
  Solid      : 29 employees  
  Elite      : 10 employees  

   DEPARTMENT SUMMARY 
  HR             : 12 emp  |  Avg KPI = 83.2
  Finance        : 14 emp  |  Avg KPI = 80.9
  Engineering    : 14 emp  |  Avg KPI = 78.2
  Operations     :  7 emp  |  Avg KPI = 76.6
  Marketing      : 13 emp  |  Avg KPI = 75.1

   TOP 3 PERFORMERS 
  #1  CO-D-EMP034      | Dept: Finance        | KPI: 99.8 | Salary: IDR 12,610
  #2  CO-D-EMP028      | Dept: Marketing      | KPI: 99.5 | Salary: IDR 9,633
  #3  CO-D-EMP040      | Dept: Finance        | KPI: 99.5 | Salary: IDR 8,254


  AUTOMATED REPORT    COMPANY CO-E


  Total Employees    : 60
  Avg Monthly Salary : IDR 10,629
  Avg KPI Score      : 77.84
  Avg Perf Score     : 75.60
  Top Performers     : 16  (26.7% of workforce)

   KPI TIER BREAKDOWN 
  At Risk    : 11 employees  
  Growing    : 12 employees  
  Solid      : 21 employees  
  Elite      : 16 employees  

   DEPARTMENT SUMMARY 
  Finance        : 15 emp  |  Avg KPI = 80.3
  Marketing      : 12 emp  |  Avg KPI = 79.8
  Engineering    : 11 emp  |  Avg KPI = 79.0
  HR             :  9 emp  |  Avg KPI = 74.8
  Operations     : 13 emp  |  Avg KPI = 74.3

   TOP 3 PERFORMERS 
  #1  CO-E-EMP022      | Dept: Operations     | KPI: 100.0 | Salary: IDR 13,084
  #2  CO-E-EMP001      | Dept: Finance        | KPI: 99.4 | Salary: IDR 14,914
  #3  CO-E-EMP024      | Dept: Finance        | KPI: 99.3 | Salary: IDR 15,818

10.3 CSV Export

# ============================================================
# TASK 8 (BONUS): Automated CSV export via for loop
# ============================================================

export_list <- list()

for (cid in all_companies) {
  cdata <- dashboard_df |> dplyr::filter(company_id == cid)

  export_list[[cid]] <- data.frame(
    company_id      = cid,
    n_employees     = nrow(cdata),
    avg_salary      = round(mean(cdata$salary), 0),
    avg_kpi         = round(mean(cdata$KPI_score), 2),
    avg_performance = round(mean(cdata$performance_score), 2),
    top_performers  = sum(cdata$top_performer == "Yes"),
    elite_count     = sum(cdata$kpi_tier == "Elite"),
    at_risk_count   = sum(cdata$kpi_tier == "At Risk")
  )
}

export_df <- dplyr::bind_rows(export_list)
write.csv(export_df, "octavia_kpi_report.csv", row.names = FALSE)

cat(" CSV exported: octavia_kpi_report.csv\n\n")

 CSV exported: octavia_kpi_report.csv

kable(export_df,
      caption = "Task 8 (Bonus) — Automated Export: Full Company KPI Summary") |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)

Task 8 (Bonus) — Automated Export: Full Company KPI Summary
company_id	n_employees	avg_salary	avg_kpi	avg_performance	top_performers	elite_count	at_risk_count
CO-A	60	10281	73.80	78.05	13	13	15
CO-B	60	10114	71.11	77.75	9	9	21
CO-C	60	10538	75.46	79.04	14	14	13
CO-D	60	10373	78.98	77.19	10	10	6
CO-E	60	10629	77.84	75.60	16	16	11

10.4 Auto-Generated Mini-Plots per Company

# ============================================================
# TASK 8 (BONUS): FOR LOOP — auto-build one plot per company
# gridExtra assembles them into a single dashboard figure
# ============================================================

plot_list <- list()

for (cid in all_companies) {
  cdata <- dashboard_df |> dplyr::filter(company_id == cid)

  p <- ggplot(cdata, aes(x = kpi_tier, fill = kpi_tier)) +
    geom_bar(alpha = 0.88, show.legend = FALSE) +
    geom_text(stat = "count", aes(label = after_stat(count)),
              vjust = -0.4, fontface = "bold", size = 3.8) +
    scale_fill_manual(values = kpi_palette) +
    scale_y_continuous(limits = c(0, 30)) +
    labs(title = paste0("Company ", cid), x = NULL, y = "Employees") +
    theme_minimal(base_size = 10) +
    theme(
      plot.title         = element_text(face = "bold", color = "#3D2B6B", size = 11),
      panel.grid.minor   = element_blank(),
      panel.grid.major.x = element_blank()
    )

  plot_list[[cid]] <- p
}

grid.arrange(grobs = plot_list, nrow = 2,
             top = "Task 8 (Bonus) — Auto-Generated KPI Tier Chart per Company")

Interpretasi Task 8 demonstrates a fully automated report generation pipeline: a single generate_report() function encapsulates all company-level logic, and a for loop executes it for every company without any copy-paste repetition. The CSV export and auto-generated plot grid illustrate how this pattern scales instantly to any number of companies. In real-world analytics, this approach would allow a data team to refresh an entire portfolio-level HR report by simply re-running one loop — the defining characteristic of a reproducible, production-grade data science workflow.

11. Summary & Conclusion

Ringkasan Lengkap Tugas — Praktikum Minggu-5

# ============================================================
# SUMMARY TABLE — All 8 tasks in one view
# ============================================================

summary_df <- data.frame(
  Task = c("Task 1","Task 2","Task 3","Task 4",
           "Task 5","Task 6","Task 7","Task 8 "),
  Concept = c(
    "Dynamic Function + Input Validation",
    "Nested Simulation + Discount Logic",
    "5-Tier Categorization Loop",
    "Multi-Company Nested Loop Generation",
    "Monte Carlo Pi & Probability",
    "Loop Normalization & Feature Engineering",
    "KPI Dashboard — Mini Project",
    "Automated Report Generation (Bonus)"
  ),
  Key_Function = c(
    "compute_formula(x, formula)",
    "simulate_sales(n_sp, days) + apply_discount()",
    "categorize_performance(sales_amount)",
    "generate_company_data(n_co, n_emp)",
    "monte_carlo_pi(n_points)",
    "normalize_columns(df) + z_score(df)",
    "Integrated pipeline: Tasks 4–6 combined",
    "generate_report(cid, df) + for loop"
  ),
  Visualization = c(
    "Multi-line chart with log-y scale",
    "Cumulative line chart per salesperson",
    "Bar chart + pie chart of tier distribution",
    "Scatter with OLS regression per company",
    "Point plot with circle & sub-square overlay",
    "Faceted histogram + violin & boxplot",
    "Grouped bar + faceted scatter + area density",
    "Auto text reports + grid of bar charts"
  ),
  check.names = FALSE
)

kable(summary_df,
      caption = "Practicum Week-5 — Full Task Summary",
      col.names = c("Task","Concept","Key Function","Visualization")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE, font_size = 13)

Practicum Week-5 — Full Task Summary
Task	Concept	Key Function	Visualization
Task 1	Dynamic Function + Input Validation	compute_formula(x, formula)	Multi-line chart with log-y scale
Task 2	Nested Simulation + Discount Logic	simulate_sales(n_sp, days) + apply_discount()	Cumulative line chart per salesperson
Task 3	5-Tier Categorization Loop	categorize_performance(sales_amount)	Bar chart + pie chart of tier distribution
Task 4	Multi-Company Nested Loop Generation	generate_company_data(n_co, n_emp)	Scatter with OLS regression per company
Task 5	Monte Carlo Pi & Probability	monte_carlo_pi(n_points)	Point plot with circle & sub-square overlay
Task 6	Loop Normalization & Feature Engineering	normalize_columns(df) + z_score(df)	Faceted histogram + violin & boxplot
Task 7	KPI Dashboard — Mini Project	Integrated pipeline: Tasks 4–6 combined	Grouped bar + faceted scatter + area density
Task 8	Automated Report Generation (Bonus)	generate_report(cid, df) + for loop	Auto text reports + grid of bar charts

Kesimpulan Utama

Practicum Week-5 has successfully demonstrated the full integration of functions, loops, simulation, and ggplot2 visualization in R. Eight core conclusions emerge:

Input validation (Task 1) makes functions robust — catching invalid inputs before they propagate silently through a pipeline.
Nested helper functions (Task 2) enforce the single-responsibility principle: apply_discount() handles business rules while simulate_sales() manages the loop structure.
Loop-based categorization (Task 3) scales cleanly to any threshold configuration — far more maintainable than hard-coded conditional chains.
Nested loops (Tasks 4 & 7) mirror real-world database ETL patterns: outer entity loop + inner record loop = multi-level dataset generation.
Monte Carlo simulation (Task 5) proves that random sampling converges to mathematical truth — the foundation of probabilistic modeling and Bayesian inference.
Loop-based normalization (Task 6) automates feature scaling across any number of numeric columns — ensuring reproducible, scalable preprocessing pipelines.
The KPI Dashboard (Task 7) unites all skills into one production-grade analytics pipeline: data generation → loop classification → multi-view visualization.
Automated reporting (Task 8) extends the pipeline to full automation: one function + one loop generates a complete multi-company HR report with CSV export — the hallmark of reproducible data science.

Together, these eight tasks confirm that reusable functions + structured loops + expressive visualization + automated reporting form the complete toolkit of a professional data scientist in R.

Practicum Week-5: FUNCTIONS & LOOPS + ILMU DATA

OCTAVIA MAIA REGO

2026-04-06

FUNCTIONS & LOOPS + ILMU DATA

PRAKTIKUM MINGGU - 5

1. Pendahuluan

Tentang Praktikum Ini

2. Libraries & Setup

Paket R yang Diperlukan

3. Tugas 1 — Dynamic Multi-Formula Function

Gambaran Umum

3.1 Function with Input Validation & Nested Loop

3.2 Visualization — Four Formulas on One Graph

4. Tugas 2 — Nested Simulation: Sales & Discounts

Gambaran Umum

4.1 Sales Simulation with Nested Functions & Loops

4.2 Summary Statistics per Salesperson

4.3 Cumulative Net Sales — Line Chart

5. Tugas 3 — Multi-Level Performance Categorization

Gambaran Umum

5.1 Categorization Function

5.2 Bar Chart — Category Frequency

5.3 Pie Chart — Proportion View

6. Tugas 4 — Multi-Company Dataset Simulation

Gambaran Umum

6.1 Nested Loop Data Generation

6.2 Summary per Company

6.3 Scatter Plot — Salary vs KPI by Company

7. Tugas 5 — Monte Carlo Simulation: Pi & Probability

Gambaran Umum

7.1 Monte Carlo Function

7.2 Visualization — Points Inside vs Outside Circle

8. Tugas 6 — Data Transformation & Feature Engineering

Gambaran Umum

8.1 Transformation Functions

8.2 Feature Engineering

8.3 Faceted Histograms — Before vs After Transformation

8.4 Violin + Boxplot — Salary by Performance Category

9. Tugas 7 — Mini Project: Company KPI Dashboard

Gambaran Umum

9.1 Generate Dashboard Dataset

9.2 Company Summary Table

9.3 Grouped Bar Chart — KPI Tier per Company

9.4 Faceted Scatter — KPI vs Salary by Company

9.5 Salary Density by KPI Tier (Area Chart)

10. Task 8 — Automated Report Generation BONUS

Gambaran Umum

10.1 Automated Report Function

10.2 Loop — Run Report for All Companies

10.3 CSV Export

10.4 Auto-Generated Mini-Plots per Company

11. Summary & Conclusion

Ringkasan Lengkap Tugas — Praktikum Minggu-5

Kesimpulan Utama