FUNCTIONS & LOOPS + DATA SCIENCE

PRAKTIKUM WEEK - 5

IDENTITY

FRENKHY TONGA RETANG

NIM: 52250005

Student in Data Science Programming

IT PROGRAMMING DATA SCIENCE DATA SCIENCE PROGRAMMING

INSTITUT TEKNOLOGI
SAINS BANDUNG
ITSB

1. Introduction

Description

Practicum Week-5 focuses on building advanced functions and loops combined with real-world data science workflows in R. The following core competencies are developed:

Building multi-layer functions with nested loops and conditional logic.
Running multi-dataset simulations (sales, company, Monte Carlo).
Performing advanced statistics, data transformation, and ggplot2 visualizations.
Developing an automated data science workflow culminating in a KPI Dashboard.
Generating an automated HTML report per company (Bonus Task).

Functions

Reusable modular blocks of logic for any computation

Nested Loops

Loop inside loop for multi-dimensional iteration

Simulation

Generate synthetic datasets for statistical analysis

Transformation

Normalize and engineer features from raw data

ggplot2

Advanced multi-layer publication-ready visualizations

2. Required Libraries & Dataset Overview

Packages Used in This Practicum

All tasks rely on the following R packages. Make sure they are installed before knitting.

# ============================================================
# LIBRARIES — load all required packages
# ============================================================

library(ggplot2)   # Grammar of Graphics — all visualizations
library(dplyr)     # Data wrangling and transformation
library(tidyr)     # Data tidying (pivot, gather, spread)
library(scales)    # Axis formatting (comma, percent, etc.)
library(knitr)     # Table rendering in HTML output
library(kableExtra)# Enhanced HTML/LaTeX table styling

Dataset Strategy

This practicum uses simulated datasets generated inside each task function. Each dataset mimics real-world data science scenarios:

Task 1 — Mathematical x values (1:20) for formula comparison.
Task 2 — Sales simulation: 5 salespersons × 10 days.
Task 3 — 100 random sales values for performance categorization.
Task 4 — HR dataset: 3 companies × 20 employees each.
Task 5 — 5,000 random points for Monte Carlo Pi estimation.
Task 6 — Same HR data from Task 4, transformed and engineered.
Task 7 — Full KPI dashboard: 5 companies × 50 employees.
Task 8 — Automated company-level HTML report generation (Bonus).

# ============================================================
# DATASET OVERVIEW — preview structure for Task 4 / 6 / 7
# ============================================================

# Preview the structure of the company dataset that will be
# generated in Task 4 and reused in Tasks 6, 7, and 8.
# Columns: company_id, employee_id, salary, department,
#          performance_score, KPI_score, top_performer

preview_structure <- data.frame(
  Column            = c("company_id","employee_id","salary","department",
                        "performance_score","KPI_score","top_performer"),
  Type              = c("character","character","numeric","character",
                        "numeric","numeric","character"),
  Description       = c("Unique company identifier (C1–C5)",
                        "Unique employee identifier per company",
                        "Monthly salary (IDR 3,000–15,000)",
                        "Department name (HR/Finance/IT/Marketing/Ops)",
                        "Performance score (50–100)",
                        "KPI score (60–100)",
                        "Top performer flag: Yes if KPI > 90")
)

kable(preview_structure, caption = "Dataset Schema — Company HR Simulation") %>%
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE, font_size = 14)

Dataset Schema — Company HR Simulation
Column	Type	Description
company_id	character	Unique company identifier (C1–C5)
employee_id	character	Unique employee identifier per company
salary	numeric	Monthly salary (IDR 3,000–15,000)
department	character	Department name (HR/Finance/IT/Marketing/Ops)
performance_score	numeric	Performance score (50–100)
KPI_score	numeric	KPI score (60–100)
top_performer	character	Top performer flag: Yes if KPI > 90

3. Task 1 — Dynamic Multi-Formula Function

Overview

Build a function compute_formula(x, formula) that computes one of four mathematical formulas: linear, quadratic, cubic, and exponential. The function validates input, uses a loop for computation, and all four results are plotted on the same graph.

3.1 Function Definition & Computation

Task Requirements

Validate formula input — stop with a clear error message if invalid.
Use nested loops: outer loop iterates formulas, inner loop iterates x values.
Plot all four formulas on the same graph for x = 1:20.

# ============================================================
# TASK 1: Dynamic Multi-Formula Function
# compute_formula(x, formula) — returns a numeric vector
# Formulas: linear | quadratic | cubic | exponential
# ============================================================

compute_formula <- function(x, formula) {
  # Define allowed formula names
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")

  # Input validation: stop if formula is not recognized
  if (!(formula %in% valid_formulas)) {
    stop(paste("Invalid formula! Choose from:",
               paste(valid_formulas, collapse = ", ")))
  }

  # Pre-allocate result vector for efficiency
  result <- numeric(length(x))

  # Inner loop: compute formula value at each x[i]
  for (i in seq_along(x)) {
    if (formula == "linear") {
      result[i] <- 2 * x[i] + 3           # f(x) = 2x + 3

    } else if (formula == "quadratic") {
      result[i] <- x[i]^2 + 2 * x[i] + 1  # f(x) = x² + 2x + 1

    } else if (formula == "cubic") {
      result[i] <- x[i]^3 - 3 * x[i]^2 + 2 * x[i]  # f(x) = x³ - 3x² + 2x

    } else if (formula == "exponential") {
      result[i] <- exp(0.3 * x[i])         # f(x) = e^(0.3x)
    }
  }
  return(result)
}

# ---- Outer loop: iterate over all 4 formulas ----
x_vals   <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")

results_list <- list()
for (f in formulas) {
  # Call compute_formula for each formula type
  results_list[[f]] <- data.frame(
    x       = x_vals,
    y       = compute_formula(x_vals, f),
    formula = f
  )
}

# Combine all formula results into one tidy data frame
df_formulas <- bind_rows(results_list)

# Show sample values for all 4 formulas at x = 1, 5, 10, 20
df_formulas %>%
  filter(x %in% c(1, 5, 10, 20)) %>%
  tidyr::pivot_wider(names_from = formula, values_from = y)

3.2 Visualization — All Formulas on One Graph

# ============================================================
# TASK 1: Plot — All 4 Formulas on the Same Graph
# Color-coded lines with points, log scale for visibility
# ============================================================

formula_colors <- c(
  "linear"      = "#4CAF50",
  "quadratic"   = "#2196F3",
  "cubic"       = "#E91E63",
  "exponential" = "#FF9800"
)

ggplot(df_formulas, aes(x = x, y = y, color = formula)) +
  geom_line(linewidth = 1.5, alpha = 0.9) +          # main trend line
  geom_point(size = 2.8, alpha = 0.85) +             # individual data points
  scale_color_manual(values = formula_colors) +
  scale_y_log10(labels = comma) +                    # log scale to show all 4 on same canvas
  labs(
    title    = "Task 1 — Dynamic Multi-Formula Comparison",
    subtitle = "Linear · Quadratic · Cubic · Exponential (x = 1 to 20, log-y scale)",
    x        = "x value",
    y        = "f(x) — log scale",
    color    = "Formula Type",
    caption  = "Source: Practicum Week-5 — compute_formula()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 15, color = "#4A148C"),
    plot.subtitle    = element_text(color = "#6A1B9A", size = 10),
    legend.position  = "top",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#FAFAFA", color = NA)
  )

Interpretation Using a log-y scale allows all four formulas to be compared on the same plot without the exponential curve dominating. The linear formula (green) grows at a constant slow rate (f(x) = 2x+3). The quadratic (blue) accelerates moderately. The cubic (pink) shows the fastest polynomial growth — note the near-zero value at x=1 due to the negative coefficient. The exponential (orange) grows the fastest overall and separates clearly from the others by x=15. The function validation ensures that passing an invalid formula name (e.g. "sqrt") raises an informative error rather than producing silent incorrect output.

4. Task 2 — Nested Simulation: Multi-Sales & Discounts

Overview

Build simulate_sales(n_salesperson, days) with a nested helper function for discount logic. The outer function loops over salespersons; the inner loop iterates over days. Conditional discounts are applied based on sales amount.

4.1 Simulation Function with Nested Loops

# ============================================================
# TASK 2: Nested Simulation — Multi-Sales & Discounts
# simulate_sales(n_salesperson, days) → tidy data frame
# Nested function: get_discount(amount) inside simulate_sales
# ============================================================

set.seed(42)  # for reproducibility

simulate_sales <- function(n_salesperson, days) {

  # ---- NESTED HELPER FUNCTION: discount rule ----
  # Returns discount rate based on sales amount thresholds
  get_discount <- function(amount) {
    if (amount > 800) {
      return(0.15)       # 15% discount for high-value sales
    } else if (amount > 500) {
      return(0.10)       # 10% for medium-value sales
    } else {
      return(0.05)       # 5% for low-value sales
    }
  }
  # ---- END NESTED FUNCTION ----

  records <- list()  # accumulate rows
  idx     <- 1       # flat index for list

  # OUTER LOOP: iterate over each salesperson
  for (sp in 1:n_salesperson) {
    cumulative <- 0  # reset cumulative total per salesperson

    # INNER LOOP: iterate over each day for this salesperson
    for (d in 1:days) {
      # Simulate random daily sales amount (200–1000)
      amount     <- round(runif(1, 200, 1000), 2)

      # Apply discount via nested helper function
      disc_rate  <- get_discount(amount)
      net_amount <- amount * (1 - disc_rate)

      # Track cumulative net sales for this salesperson
      cumulative <- cumulative + net_amount

      # Store each record
      records[[idx]] <- data.frame(
        sales_id       = sp,
        day            = d,
        sales_amount   = amount,
        discount_rate  = disc_rate,
        net_amount     = round(net_amount, 2),
        cumulative_net = round(cumulative, 2)
      )
      idx <- idx + 1
    }
  }

  return(bind_rows(records))
}

# Run: 5 salespersons, 10 days each
sales_df <- simulate_sales(n_salesperson = 5, days = 10)

# Show first 10 rows
head(sales_df, 10)

4.2 Summary Statistics per Salesperson

# ============================================================
# TASK 2: Summary statistics aggregated per salesperson
# ============================================================

sales_summary <- sales_df %>%
  group_by(sales_id) %>%
  summarise(
    Total_Sales_Amount = round(sum(sales_amount), 0),
    Total_Net_Revenue  = round(sum(net_amount), 0),
    Avg_Discount_Pct   = paste0(round(mean(discount_rate) * 100, 1), "%"),
    Max_Cumulative     = round(max(cumulative_net), 0),
    .groups = "drop"
  )

kable(sales_summary,
      caption = "Task 2 — Salesperson Performance Summary (5 SP × 10 Days)",
      col.names = c("Sales ID","Total Sales","Net Revenue","Avg Discount %","Max Cumulative")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Task 2 — Salesperson Performance Summary (5 SP × 10 Days)
Sales ID	Total Sales	Net Revenue	Avg Discount %	Max Cumulative
1	7090	6281	10.5%	6281
2	6720	5939	10.5%	5939
3	6923	6026	11.5%	6026
4	6154	5445	10%	5445
5	7067	6180	11.5%	6180

4.3 Cumulative Sales Plot

# ============================================================
# TASK 2: Line Chart — Cumulative Net Sales per Salesperson
# ============================================================

# Convert sales_id to labelled factor for better legend
sales_df$sales_id <- factor(sales_df$sales_id,
                             labels = paste0("SP-", 1:5))

ggplot(sales_df, aes(x = day, y = cumulative_net, color = sales_id)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3, shape = 21,
             aes(fill = sales_id), color = "white", stroke = 1.5) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title    = "Task 2 — Cumulative Net Sales per Salesperson (10 Days)",
    subtitle = "Discount rule: Amount >800 → 15% | >500 → 10% | else → 5%",
    x        = "Day",
    y        = "Cumulative Net Sales",
    color    = "Salesperson",
    fill     = "Salesperson",
    caption  = "Source: Practicum Week-5 — simulate_sales()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 15, color = "#01579B"),
    plot.subtitle    = element_text(color = "#0277BD", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#F0F8FF", color = NA)
  )

Interpretation The cumulative line chart confirms that all 5 salespersons grow their net revenue steadily over the 10-day period. The spread between salespersons widens over time, reflecting differences in daily sales luck and discount eligibility. Salespersons who regularly achieve amounts above 800 receive higher discounts (15%), which reduces net revenue slightly — explaining why the highest gross seller is not necessarily the highest net revenue earner. The nested helper function get_discount() demonstrates a clean software design pattern: encapsulating business logic inside the parent function to keep code modular and testable.

5. Task 3 — Multi-Level Performance Categorization

Overview

Build categorize_performance(sales_amount) that assigns one of five categories: Excellent, Very Good, Good, Average, Poor. Output percentage breakdown, bar chart, and pie chart.

5.1 Categorization Function

# ============================================================
# TASK 3: Multi-Level Performance Categorization
# categorize_performance(x) → character vector of 5 tiers
# Uses a for loop + if-else chain over the full vector
# ============================================================

categorize_performance <- function(sales_amount) {
  # Pre-allocate output vector
  category <- character(length(sales_amount))

  # Loop through every element and assign a category
  for (i in seq_along(sales_amount)) {
    val <- sales_amount[i]

    if (val >= 900) {
      category[i] <- "Excellent"   # Top tier: amount >= 900

    } else if (val >= 700) {
      category[i] <- "Very Good"   # 700 ≤ amount < 900

    } else if (val >= 500) {
      category[i] <- "Good"        # 500 ≤ amount < 700

    } else if (val >= 300) {
      category[i] <- "Average"     # 300 ≤ amount < 500

    } else {
      category[i] <- "Poor"        # amount < 300 — lowest tier
    }
  }
  return(category)
}

# ---- Apply function to 100 simulated sales values ----
set.seed(42)
raw_sales <- round(runif(100, 100, 1000), 2)  # uniform random: 100–1000
categories <- categorize_performance(raw_sales)

# Build data frame with ordered factor for correct plot ordering
perf_df <- data.frame(
  sales_amount = raw_sales,
  category     = factor(categories,
                         levels = c("Poor","Average","Good","Very Good","Excellent"))
)

# Calculate count and percentage per category
perf_pct <- perf_df %>%
  count(category) %>%
  mutate(pct = round(n / sum(n) * 100, 1),
         label = paste0(n, " (", pct, "%)"))

kable(perf_pct, caption = "Task 3 — Category Distribution (n=100 sales records)",
      col.names = c("Category","Count","Percentage (%)","Label")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Task 3 — Category Distribution (n=100 sales records)
Category	Count	Percentage (%)	Label
Poor	22	22	22 (22%)
Average	18	18	18 (18%)
Good	21	21	21 (21%)
Very Good	23	23	23 (23%)
Excellent	16	16	16 (16%)

5.2 Bar Chart — Category Distribution

# ============================================================
# TASK 3: Bar Chart — performance category frequency
# ============================================================

cat_colors <- c("Poor"="#EF9A9A","Average"="#FFCC80",
                 "Good"="#A5D6A7","Very Good"="#81D4FA","Excellent"="#CE93D8")

ggplot(perf_pct, aes(x = category, y = n, fill = category)) +
  geom_col(width = 0.65, show.legend = FALSE, alpha = 0.9) +
  geom_text(aes(label = label), vjust = -0.5,
            fontface = "bold", size = 4, color = "#333333") +
  scale_fill_manual(values = cat_colors) +
  labs(
    title    = "Task 3 — Performance Category Distribution (Bar Chart)",
    subtitle = "Based on 100 simulated sales values (uniform: 100–1000)",
    x        = "Performance Category",
    y        = "Number of Records",
    caption  = "Source: Practicum Week-5 — categorize_performance()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title         = element_text(face = "bold", size = 14, color = "#004D40"),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank()
  )

5.3 Pie Chart — Proportion View

# ============================================================
# TASK 3: Pie Chart — proportion by category
# ============================================================

ggplot(perf_pct, aes(x = "", y = n, fill = category)) +
  geom_col(width = 1, color = "white", linewidth = 0.9) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(category, "\n", pct, "%")),
            position = position_stack(vjust = 0.5),
            fontface = "bold", size = 3.8, color = "#1A1A1A") +
  scale_fill_manual(values = cat_colors) +
  labs(
    title   = "Task 3 — Performance Category Distribution (Pie Chart)",
    fill    = "Category",
    caption = "Source: Practicum Week-5 — categorize_performance()"
  ) +
  theme_void() +
  theme(
    plot.title      = element_text(face = "bold", size = 14,
                                   hjust = 0.5, color = "#004D40"),
    legend.position = "right"
  )

Interpretation Since the input data is drawn from a uniform distribution between 100 and 1000, each of the five performance tiers (each spanning 200 units) should contain approximately 20% of records. The bar and pie charts confirm this — all five categories are close to 20%, validating the correct logic of the categorize_performance() loop. In a real sales scenario with a skewed distribution, the chart would reveal which tier is most concentrated, guiding incentive program design.

6. Task 4 — Multi-Company Dataset Simulation

Overview

Build generate_company_data(n_company, n_employees) using nested loops (outer: company, inner: employee). Conditional KPI logic flags top performers. Output includes a summary table and scatter visualization.

6.1 Data Generation with Nested Loops

# ============================================================
# TASK 4: Multi-Company Dataset Simulation
# generate_company_data(n_company, n_employees) → data frame
# Nested loops: outer = company, inner = employee
# Conditional logic: KPI > 90 → top_performer = "Yes"
# ============================================================

set.seed(123)  # fixed seed for reproducibility

generate_company_data <- function(n_company, n_employees) {
  # Possible department names
  departments <- c("HR", "Finance", "IT", "Marketing", "Operations")

  records <- list()   # accumulate all employee records
  idx     <- 1        # flat row index

  # OUTER LOOP: iterate over each company
  for (c in 1:n_company) {

    # INNER LOOP: iterate over each employee in this company
    for (e in 1:n_employees) {
      # Generate random numeric attributes
      salary      <- round(runif(1, 3000, 15000), 0)  # monthly salary
      dept        <- sample(departments, 1)            # random department
      perf_score  <- round(runif(1, 50, 100), 1)      # performance score
      kpi_score   <- round(runif(1, 60, 100), 1)      # KPI score

      # CONDITIONAL: flag top performers (KPI > 90)
      top_performer <- ifelse(kpi_score > 90, "Yes", "No")

      # Save record
      records[[idx]] <- data.frame(
        company_id        = paste0("C", c),
        employee_id       = paste0("C", c, "_E", e),
        salary            = salary,
        department        = dept,
        performance_score = perf_score,
        KPI_score         = kpi_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
      idx <- idx + 1
    }
  }
  return(bind_rows(records))
}

# Generate: 3 companies, 20 employees each → 60 rows total
company_df <- generate_company_data(n_company = 3, n_employees = 20)

# Show first 8 rows
head(company_df, 8)

6.2 Company Summary Table

# ============================================================
# TASK 4: Company-level aggregation using dplyr
# ============================================================

company_summary <- company_df %>%
  group_by(company_id) %>%
  summarise(
    Avg_Salary      = formatC(round(mean(salary), 0), big.mark=",", format="d"),
    Avg_Performance = round(mean(performance_score), 2),
    Max_KPI         = max(KPI_score),
    Top_Performers  = sum(top_performer == "Yes"),  # conditional count
    .groups         = "drop"
  )

kable(company_summary,
      caption = "Task 4 — Company Summary (3 Companies × 20 Employees)",
      col.names = c("Company","Avg Salary","Avg Performance","Max KPI","Top Performers")) %>%
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE)

Task 4 — Company Summary (3 Companies × 20 Employees)
Company	Avg Salary	Avg Performance	Max KPI	Top Performers
C1	8,147	76.97	97.6	5
C2	8,937	75.44	99.4	6
C3	8,628	68.32	98.9	7

6.3 Visualization — KPI vs Salary by Company

# ============================================================
# TASK 4: Scatter Plot — Salary vs KPI, color by company
# Shape encodes top performer status (KPI > 90)
# ============================================================

ggplot(company_df, aes(x = KPI_score, y = salary,
                        color = company_id, shape = top_performer)) +
  geom_point(size = 3.5, alpha = 0.85) +
  # Regression line per company (no standard error band for clarity)
  geom_smooth(method = "lm", se = FALSE, linewidth = 1,
              aes(group = company_id)) +
  scale_color_manual(values = c("C1"="#E91E63","C2"="#2196F3","C3"="#4CAF50")) +
  scale_shape_manual(values = c("Yes"=17, "No"=16),
                     labels = c("Yes"="Top Performer (KPI>90)","No"="Regular")) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Task 4 — Salary vs KPI Score by Company",
    subtitle = "Triangle = Top Performer (KPI > 90) | Lines = Linear trend per company",
    x        = "KPI Score",
    y        = "Monthly Salary",
    color    = "Company",
    shape    = "Status",
    caption  = "Source: Practicum Week-5 — generate_company_data()"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#BF360C"),
    plot.subtitle    = element_text(color = "#D84315", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(fill = "#FFF8F5", color = NA)
  )

Interpretation The scatter plot shows that salary is not strongly correlated with KPI score in this simulated dataset — the three regression lines are nearly flat. This is expected: salary and KPI were generated independently. Top performers (triangles, KPI > 90) are distributed across all salary levels, suggesting that high KPI does not automatically result in higher compensation in this simulation — a realistic finding that often holds in organizations where pay scales are fixed by grade rather than performance. This would prompt real HR leaders to revisit their compensation structure.

7. Task 5 — Monte Carlo Simulation: Pi & Probability

Overview

Build monte_carlo_pi(n_points) that estimates π by checking whether random points in a unit square fall inside a unit circle. A secondary analysis computes the probability of points landing in a sub-square region.

7.1 Monte Carlo Function

# ============================================================
# TASK 5: Monte Carlo Simulation — Pi Estimation
# monte_carlo_pi(n_points) → list with pi estimate, prob, coords
# Uses a for loop over n_points iterations
# ============================================================

set.seed(7)  # fixed seed for reproducibility

monte_carlo_pi <- function(n_points) {
  # Generate n random points in the square [-1, 1] x [-1, 1]
  x_pts <- runif(n_points, -1, 1)
  y_pts <- runif(n_points, -1, 1)

  # Initialize tracking vectors
  in_circle    <- numeric(n_points)  # 1 if inside unit circle
  in_subsquare <- numeric(n_points)  # 1 if inside sub-square [-0.5,0.5]^2

  # FOR LOOP: check each point
  for (i in 1:n_points) {

    # Check if point (x, y) is inside the unit circle (r = 1)
    # Condition: x² + y² ≤ 1
    if (x_pts[i]^2 + y_pts[i]^2 <= 1) {
      in_circle[i] <- 1
    }

    # Check if point is inside the sub-square [-0.5, 0.5]
    # Area of sub-square = 1×1 = 1, out of full square 2×2 = 4
    # Expected probability ≈ 1/4 = 0.25
    if (abs(x_pts[i]) <= 0.5 & abs(y_pts[i]) <= 0.5) {
      in_subsquare[i] <- 1
    }
  }

  # Estimate π: area ratio × 4 = π
  pi_estimate    <- 4 * sum(in_circle) / n_points
  prob_subsquare <- sum(in_subsquare) / n_points

  return(list(
    pi_estimate    = pi_estimate,
    prob_subsquare = prob_subsquare,
    x              = x_pts,
    y              = y_pts,
    in_circle      = in_circle
  ))
}

# Run Monte Carlo with 5,000 random points
mc_result <- monte_carlo_pi(5000)

# Print results with comparison to true π
cat("=== Monte Carlo Simulation Results (n = 5,000) ===\n")

=== Monte Carlo Simulation Results (n = 5,000) ===

cat(sprintf("Estimated Pi        : %.6f\n", mc_result$pi_estimate))

Estimated Pi        : 3.131200

cat(sprintf("True Pi (R built-in): %.6f\n", pi))

True Pi (R built-in): 3.141593

cat(sprintf("Absolute Error      : %.6f\n", abs(mc_result$pi_estimate - pi)))

Absolute Error      : 0.010393

cat(sprintf("Error Percentage    : %.4f%%\n", abs(mc_result$pi_estimate - pi)/pi*100))

Error Percentage    : 0.3308%

cat(sprintf("\nSub-square Prob     : %.4f\n", mc_result$prob_subsquare))


Sub-square Prob     : 0.2592

cat(sprintf("Expected Prob (1/4) : 0.2500\n"))

Expected Prob (1/4) : 0.2500

7.2 Visualization — Points Inside vs Outside Circle

# ============================================================
# TASK 5: Scatter plot — visualize Monte Carlo sampling
# Green = inside circle, red = outside
# Blue circle overlay = theoretical boundary
# Orange dashed rectangle = sub-square region
# ============================================================

mc_plot_df <- data.frame(
  x         = mc_result$x,
  y         = mc_result$y,
  in_circle = factor(mc_result$in_circle,
                      levels = c(0, 1),
                      labels = c("Outside Circle", "Inside Circle"))
)

ggplot(mc_plot_df, aes(x = x, y = y, color = in_circle)) +
  geom_point(size = 0.55, alpha = 0.55) +
  # Theoretical unit circle overlay
  annotate("path",
           x = cos(seq(0, 2 * pi, length.out = 300)),
           y = sin(seq(0, 2 * pi, length.out = 300)),
           color = "#1A237E", linewidth = 1.2) +
  # Sub-square region overlay (dashed orange)
  annotate("rect", xmin = -0.5, xmax = 0.5, ymin = -0.5, ymax = 0.5,
           color = "#E65100", fill = NA, linewidth = 1.1, linetype = "dashed") +
  scale_color_manual(values = c("Outside Circle" = "#EF9A9A",
                                 "Inside Circle"  = "#66BB6A")) +
  coord_fixed() +
  labs(
    title    = "Task 5 — Monte Carlo Pi Estimation (n = 5,000 points)",
    subtitle = paste0("Estimated π = ", round(mc_result$pi_estimate, 5),
                      "   |   Sub-square hit rate = ",
                      round(mc_result$prob_subsquare, 4)),
    x        = "x coordinate",
    y        = "y coordinate",
    color    = "Point Position",
    caption  = "Blue = unit circle | Orange dashed = sub-square [-0.5, 0.5]²"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#1B5E20"),
    plot.subtitle    = element_text(color = "#2E7D32", size = 10),
    legend.position  = "bottom",
    panel.grid.minor = element_blank()
  )

Interpretation With 5,000 random points, the Monte Carlo method estimates π to within a small margin of error. The geometric intuition: the area of a unit circle is π, the enclosing 2×2 square has area 4, so the proportion of points inside the circle converges to π/4 as n → ∞. Multiplying by 4 gives the π estimate. The sub-square (orange dashed) captures approximately 25% of all points — matching its theoretical area fraction (1×1 out of 2×2 = 0.25). Increasing n_points improves accuracy at the cost of computation time, demonstrating the fundamental statistical tradeoff in Monte Carlo methods.

8. Task 6 — Advanced Data Transformation & Feature Engineering

Overview

Build normalize_columns(df) and z_score(df) using loop-based iteration over column names. Then engineer two new categorical features. Compare distributions before and after transformation using a histogram (faceted) and a violin + boxplot.

8.1 Transformation Functions

# ============================================================
# TASK 6: Loop-based Normalization & Standardization
# normalize_columns(df) → Min-Max to [0, 1]
# z_score(df)           → Zero mean, unit standard deviation
# Both iterate over column names with a for loop
# ============================================================

# ---- Function 1: Min-Max Normalization ----
normalize_columns <- function(df) {
  # Identify numeric columns automatically
  num_cols <- names(df)[sapply(df, is.numeric)]
  result   <- df

  # Loop over each numeric column
  for (col in num_cols) {
    min_val       <- min(df[[col]], na.rm = TRUE)
    max_val       <- max(df[[col]], na.rm = TRUE)

    # Apply Min-Max formula: (x - min) / (max - min)
    result[[col]] <- (df[[col]] - min_val) / (max_val - min_val)
  }
  return(result)
}

# ---- Function 2: Z-Score Standardization ----
z_score <- function(df) {
  num_cols <- names(df)[sapply(df, is.numeric)]
  result   <- df

  for (col in num_cols) {
    mu            <- mean(df[[col]], na.rm = TRUE)  # column mean
    sigma         <- sd(df[[col]],   na.rm = TRUE)  # column std dev

    # Apply Z-score formula: (x - μ) / σ
    result[[col]] <- (df[[col]] - mu) / sigma
  }
  return(result)
}

# ---- Apply to numeric columns from company dataset (Task 4) ----
company_num    <- company_df %>% select(salary, performance_score, KPI_score)

company_norm   <- normalize_columns(company_num)  # Min-Max
company_zscore <- z_score(company_num)             # Z-Score

# Compare summary statistics
cat("--- ORIGINAL ---\n")

--- ORIGINAL ---

summary(company_num)

     salary      performance_score   KPI_score    
 Min.   : 3505   Min.   :50.30     Min.   :60.00  
 1st Qu.: 5170   1st Qu.:63.20     1st Qu.:73.15  
 Median : 8430   Median :70.55     Median :85.35  
 Mean   : 8571   Mean   :73.58     Mean   :82.64  
 3rd Qu.:11562   3rd Qu.:82.92     3rd Qu.:91.85  
 Max.   :14609   Max.   :99.70     Max.   :99.40

cat("\n--- MIN-MAX NORMALIZED [0,1] ---\n")


--- MIN-MAX NORMALIZED [0,1] ---

summary(company_norm)

     salary       performance_score   KPI_score     
 Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
 1st Qu.:0.1499   1st Qu.:0.2611    1st Qu.:0.3338  
 Median :0.4436   Median :0.4099    Median :0.6434  
 Mean   :0.4562   Mean   :0.4712    Mean   :0.5746  
 3rd Qu.:0.7256   3rd Qu.:0.6604    3rd Qu.:0.8084  
 Max.   :1.0000   Max.   :1.0000    Max.   :1.0000

cat("\n--- Z-SCORE STANDARDIZED (mean=0, sd=1) ---\n")


--- Z-SCORE STANDARDIZED (mean=0, sd=1) ---

summary(company_zscore)

     salary         performance_score   KPI_score      
 Min.   :-1.44752   Min.   :-1.7673   Min.   :-1.9302  
 1st Qu.:-0.97188   1st Qu.:-0.7878   1st Qu.:-0.8091  
 Median :-0.04004   Median :-0.2297   Median : 0.2310  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.85473   3rd Qu.: 0.7100   3rd Qu.: 0.7852  
 Max.   : 1.72549   Max.   : 1.9837   Max.   : 1.4289

8.2 Feature Engineering

# ============================================================
# TASK 6: Feature Engineering — create 2 new categorical columns
# performance_category: Low / Medium / High (from perf_score)
# salary_bracket:       Entry / Mid / Top (from salary)
# ============================================================

company_df <- company_df %>%
  mutate(
    # New feature 1: performance tier based on score
    performance_category = case_when(
      performance_score >= 85 ~ "High",
      performance_score >= 70 ~ "Medium",
      TRUE                    ~ "Low"
    ),
    # New feature 2: salary bracket
    salary_bracket = case_when(
      salary >= 12000 ~ "Top",
      salary >= 7000  ~ "Mid",
      TRUE            ~ "Entry"
    )
  )

# Distribution summary of new features
cat("=== New Feature: performance_category ===\n")

=== New Feature: performance_category ===

print(table(company_df$performance_category))


  High    Low Medium 
    14     28     18

cat("\n=== New Feature: salary_bracket ===\n")


=== New Feature: salary_bracket ===

print(table(company_df$salary_bracket))


Entry   Mid   Top 
   23    23    14

8.3 Distribution Comparison — Histograms (Faceted)

# ============================================================
# TASK 6: Histogram Comparison — salary before & after transform
# Faceted into 3 panels: Original | Min-Max | Z-Score
# ============================================================

# Stack all three salary distributions into one tidy data frame
hist_df <- bind_rows(
  data.frame(value = company_num$salary,    type = "1. Original"),
  data.frame(value = company_norm$salary,   type = "2. Min-Max Normalized"),
  data.frame(value = company_zscore$salary, type = "3. Z-Score Standardized")
)

ggplot(hist_df, aes(x = value, fill = type)) +
  geom_histogram(bins = 18, color = "white", alpha = 0.88) +
  facet_wrap(~type, scales = "free_x", nrow = 1) +
  scale_fill_manual(values = c("1. Original"            = "#90CAF9",
                                "2. Min-Max Normalized"  = "#A5D6A7",
                                "3. Z-Score Standardized"= "#CE93D8")) +
  labs(
    title   = "Task 6 — Salary Distribution: Before vs After Transformation",
    subtitle = "Shape is identical across all 3 — only the scale changes",
    x       = "Value",
    y       = "Count",
    caption = "Source: Practicum Week-5 — normalize_columns() & z_score()"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title       = element_text(face = "bold", size = 13, color = "#4A148C"),
    plot.subtitle    = element_text(color = "#6A1B9A", size = 10),
    legend.position  = "none",
    strip.text       = element_text(face = "bold", size = 10),
    panel.grid.minor = element_blank()
  )

8.4 Violin + Boxplot — Salary by Performance Category

# ============================================================
# TASK 6: Violin + Boxplot — shows full distribution shape
# Violins reveal the density, boxplots show median & IQR
# ============================================================

company_df$performance_category <- factor(company_df$performance_category,
                                           levels = c("Low","Medium","High"))

ggplot(company_df, aes(x = performance_category, y = salary,
                        fill = performance_category)) +
  # Violin: shows density/distribution shape
  geom_violin(alpha = 0.55, trim = FALSE) +
  # Boxplot overlay: shows median, IQR, and outliers
  geom_boxplot(width = 0.18, alpha = 0.85, outlier.size = 3,
               outlier.color = "#B71C1C", color = "#333333") +
  # Individual data points
  geom_jitter(width = 0.06, alpha = 0.45, size = 2, color = "#455A64") +
  scale_fill_manual(values = c("Low"="#FFCDD2","Medium"="#FFF9C4","High"="#C8E6C9")) +
  scale_y_continuous(labels = comma) +
  labs(
    title   = "Task 6 — Salary Distribution by Performance Category",
    subtitle = "Violin = distribution shape | Box = median & IQR | Points = individual employees",
    x       = "Performance Category",
    y       = "Monthly Salary",
    fill    = "Category",
    caption = "Source: Practicum Week-5 — Feature Engineering"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#4A148C"),
    plot.subtitle    = element_text(color = "#6A1B9A", size = 10),
    legend.position  = "none",
    panel.grid.minor = element_blank()
  )

Interpretation The faceted histograms confirm the key principle of normalization: the shape of the distribution is perfectly preserved — only the scale (x-axis) changes. Min-Max compression maps all values to [0, 1], making columns directly comparable. Z-Score standardization centers data at 0 with unit variance, which is essential for distance-based machine learning algorithms. The violin+boxplot reveals that salary ranges are broadly similar across all three performance categories in this simulated dataset — a pattern that would trigger a compensation review in a real HR context.

9. Task 7 — Mini Project: Company KPI Dashboard

Overview

Generate a full dataset for 5 companies × 50 employees each (250 rows total). Use a loop to categorize employees into 4 KPI tiers. Output: summary table, grouped bar chart, scatter with regression, department analysis, and salary distribution.

9.1 Generate Full Dashboard Dataset

# ============================================================
# TASK 7: Full KPI Dashboard Dataset — 5 companies × 50 employees
# Reuses generate_company_data() from Task 4
# Adds KPI tier classification via a for loop
# ============================================================

set.seed(99)
dashboard_df <- generate_company_data(n_company = 5, n_employees = 50)

# ---- FOR LOOP: classify each employee into a KPI tier ----
kpi_tier <- character(nrow(dashboard_df))  # pre-allocate

for (i in 1:nrow(dashboard_df)) {
  kpi <- dashboard_df$KPI_score[i]

  if (kpi >= 90) {
    kpi_tier[i] <- "Elite"       # top 10% KPI achievers

  } else if (kpi >= 75) {
    kpi_tier[i] <- "Strong"      # solid performers

  } else if (kpi >= 60) {
    kpi_tier[i] <- "Developing"  # on track but room to grow

  } else {
    kpi_tier[i] <- "At Risk"     # needs intervention
  }
}

# Attach tier as ordered factor
dashboard_df$kpi_tier <- factor(kpi_tier,
                                 levels = c("At Risk","Developing","Strong","Elite"))

cat(sprintf("Dataset: %d employees across %d companies\n",
            nrow(dashboard_df), length(unique(dashboard_df$company_id))))

Dataset: 250 employees across 5 companies

head(dashboard_df, 6)

9.2 Summary per Company

# ============================================================
# TASK 7: Aggregate KPI dashboard summary per company
# ============================================================

dashboard_summary <- dashboard_df %>%
  group_by(company_id) %>%
  summarise(
    Avg_Salary       = formatC(round(mean(salary), 0), big.mark=",", format="d"),
    Avg_KPI          = round(mean(KPI_score), 2),
    Top_Performers   = sum(top_performer == "Yes"),
    Elite_Count      = sum(kpi_tier == "Elite"),
    Developing_Count = sum(kpi_tier == "Developing"),
    At_Risk_Count    = sum(kpi_tier == "At Risk"),
    .groups          = "drop"
  )

kable(dashboard_summary,
      caption = "Task 7 — KPI Dashboard: Company Summary (5 Companies × 50 Employees)",
      col.names = c("Company","Avg Salary","Avg KPI","Top Performers",
                    "Elite","Developing","At Risk")) %>%
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE)

Task 7 — KPI Dashboard: Company Summary (5 Companies × 50 Employees)
Company	Avg Salary	Avg KPI	Top Performers	Elite	Developing
C1	8,617	82.73	15	15	14
C2	8,928	80.05	14	14	20
C3	9,304	79.57	11	11	19
C4	8,564	79.70	10	10	17
C5	8,492	80.15	11	11	14

9.3 Grouped Bar Chart — KPI Tiers per Company

# ============================================================
# TASK 7: Grouped Bar — employee count per KPI tier, per company
# ============================================================

tier_dist <- dashboard_df %>%
  count(company_id, kpi_tier) %>%
  group_by(company_id) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

tier_colors <- c("At Risk"    = "#EF9A9A",
                  "Developing" = "#FFCC80",
                  "Strong"     = "#81D4FA",
                  "Elite"      = "#CE93D8")

ggplot(tier_dist, aes(x = company_id, y = n, fill = kpi_tier)) +
  geom_col(position = "dodge", width = 0.72, alpha = 0.9) +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_dodge(width = 0.72),
            vjust = -0.5, size = 3.2, fontface = "bold") +
  scale_fill_manual(values = tier_colors) +
  labs(
    title    = "Task 7 — KPI Tier Distribution per Company",
    subtitle = "Elite (≥90) · Strong (75–89) · Developing (60–74) · At Risk (<60)",
    x        = "Company",
    y        = "Number of Employees",
    fill     = "KPI Tier",
    caption  = "Source: Practicum Week-5 — KPI Dashboard (n=250)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title         = element_text(face = "bold", size = 14, color = "#8C6900"),
    plot.subtitle      = element_text(color = "#A07800", size = 10),
    legend.position    = "top",
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank()
  )

9.4 Scatter with Regression — Salary vs KPI (Faceted)

# ============================================================
# TASK 7: Faceted Scatter — Salary vs KPI per company
# Regression line + 95% CI band per facet
# ============================================================

comp_colors <- c("C1"="#E91E63","C2"="#2196F3","C3"="#4CAF50",
                  "C4"="#FF9800","C5"="#9C27B0")

ggplot(dashboard_df, aes(x = KPI_score, y = salary, color = company_id)) +
  geom_point(aes(shape = kpi_tier), size = 2.5, alpha = 0.72) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.1,
              aes(fill = company_id), alpha = 0.10) +
  scale_color_manual(values = comp_colors) +
  scale_fill_manual(values  = comp_colors) +
  scale_y_continuous(labels = comma) +
  scale_shape_manual(values = c("At Risk"=4,"Developing"=16,
                                 "Strong"=17,"Elite"=18)) +
  facet_wrap(~company_id, nrow = 2) +
  labs(
    title    = "Task 7 — Salary vs KPI Score with Regression Lines (Faceted)",
    subtitle = "Shaded area = 95% confidence interval | Shape = KPI tier",
    x        = "KPI Score",
    y        = "Monthly Salary",
    color    = "Company",
    shape    = "KPI Tier",
    caption  = "Source: Practicum Week-5 — KPI Dashboard"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#311B92"),
    plot.subtitle    = element_text(color = "#4527A0", size = 10),
    legend.position  = "bottom",
    panel.grid.minor = element_blank(),
    strip.text       = element_text(face = "bold", size = 10)
  )

9.5 Department Analysis — Avg Salary & KPI

# ============================================================
# TASK 7: Horizontal bar — department vs avg salary
# Secondary axis shows avg KPI score
# ============================================================

dept_summary <- dashboard_df %>%
  group_by(department) %>%
  summarise(
    Avg_Salary = mean(salary),
    Avg_KPI    = mean(KPI_score),
    Count      = n(),
    .groups    = "drop"
  ) %>%
  arrange(desc(Avg_Salary))

ggplot(dept_summary, aes(x = reorder(department, Avg_Salary))) +
  geom_col(aes(y = Avg_Salary, fill = department), alpha = 0.85, width = 0.55) +
  geom_point(aes(y = Avg_KPI * 100), color = "#1A237E", size = 5) +
  geom_line(aes(y = Avg_KPI * 100, group = 1),
            color = "#1A237E", linewidth = 1.1, linetype = "dashed") +
  scale_fill_brewer(palette = "Pastel1") +
  scale_y_continuous(
    name     = "Avg Monthly Salary",
    labels   = comma,
    sec.axis = sec_axis(~. / 100, name = "Avg KPI Score")
  ) +
  coord_flip() +
  labs(
    title   = "Task 7 — Department: Avg Salary & KPI Score",
    x       = "Department",
    fill    = "Department",
    caption = "Bars = Avg Salary | Blue points/line = Avg KPI (scaled ×100)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title      = element_text(face = "bold", size = 14, color = "#311B92"),
    legend.position = "none",
    panel.grid.minor = element_blank()
  )

9.6 Salary Distribution by KPI Tier (Area Chart)

# ============================================================
# TASK 7: Area Chart — salary density by KPI tier
# Overlapping filled density curves per tier
# ============================================================

ggplot(dashboard_df, aes(x = salary, fill = kpi_tier, color = kpi_tier)) +
  geom_density(alpha = 0.35, linewidth = 0.8) +
  scale_fill_manual(values  = tier_colors) +
  scale_color_manual(values = c("At Risk"    = "#C62828",
                                 "Developing" = "#E65100",
                                 "Strong"     = "#01579B",
                                 "Elite"      = "#6A1B9A")) +
  scale_x_continuous(labels = comma) +
  labs(
    title    = "Task 7 — Salary Density by KPI Tier",
    subtitle = "Area chart showing salary distribution overlap across all four KPI tiers",
    x        = "Monthly Salary",
    y        = "Density",
    fill     = "KPI Tier",
    color    = "KPI Tier",
    caption  = "Source: Practicum Week-5 — KPI Dashboard (n=250)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, color = "#311B92"),
    plot.subtitle    = element_text(color = "#4527A0", size = 10),
    legend.position  = "right",
    panel.grid.minor = element_blank()
  )

Interpretation The KPI Dashboard synthesizes all prior tasks into one integrated pipeline. The grouped bar chart confirms that KPI tier proportions are broadly similar across all 5 companies (as expected from the same random seed function). The faceted scatter plots reveal near-zero slope between KPI and salary in each company — a realistic finding in organizations where pay is grade-based rather than performance-driven. The department analysis shows roughly equal salary and KPI levels across departments, while the area chart demonstrates that salary distributions largely overlap across KPI tiers — further confirming the misalignment between performance and compensation in this simulation.

10. Task 8 — Automated Report Generation

Overview

Use functions + loops to automatically generate a structured company summary report for each company in the dashboard dataset. The report includes: per-company statistics, KPI tier breakdown, department breakdown, top performer listing, and a mini ggplot2 visualization. All content is generated programmatically from a single loop.

10.1 Automated Report Function

# ============================================================
# TASK 8 (BONUS): Automated Report Generation per Company
# generate_company_report(company_id, df) → prints full summary
# Called inside a for loop to process all companies
# ============================================================

generate_company_report <- function(cid, df) {

  # Filter data for this specific company
  cdata <- df %>% filter(company_id == cid)

  # ---- Compute summary statistics ----
  avg_salary  <- round(mean(cdata$salary), 0)
  avg_kpi     <- round(mean(cdata$KPI_score), 2)
  avg_perf    <- round(mean(cdata$performance_score), 2)
  n_employees <- nrow(cdata)

  # Count top performers (KPI > 90)
  n_top       <- sum(cdata$top_performer == "Yes")
  pct_top     <- round(n_top / n_employees * 100, 1)

  # KPI tier breakdown (using table)
  tier_tbl    <- table(cdata$kpi_tier)

  # Department breakdown
  dept_tbl    <- cdata %>%
    group_by(department) %>%
    summarise(Count = n(), Avg_KPI = round(mean(KPI_score), 1), .groups = "drop") %>%
    arrange(desc(Avg_KPI))

  # List top 3 employees by KPI
  top3 <- cdata %>%
    arrange(desc(KPI_score)) %>%
    select(employee_id, department, KPI_score, salary) %>%
    head(3)

  # ---- Print formatted report ----
  cat(rep("=", 60), "\n", sep="")
  cat(sprintf("  AUTOMATED REPORT — COMPANY %s\n", cid))
  cat(rep("=", 60), "\n\n", sep="")

  cat(sprintf("  Total Employees    : %d\n", n_employees))
  cat(sprintf("  Avg Monthly Salary : IDR %s\n", formatC(avg_salary, big.mark=",", format="d")))
  cat(sprintf("  Avg KPI Score      : %.2f\n", avg_kpi))
  cat(sprintf("  Avg Perf Score     : %.2f\n", avg_perf))
  cat(sprintf("  Top Performers     : %d (%.1f%% of workforce)\n\n", n_top, pct_top))

  cat("  --- KPI TIER BREAKDOWN ---\n")
  for (tier in names(tier_tbl)) {
    bar_len <- round(tier_tbl[[tier]] / n_employees * 30)  # progress bar scale
    bar     <- paste0(rep("█", bar_len), collapse = "")
    cat(sprintf("  %-12s : %2d employees  %s\n", tier, tier_tbl[[tier]], bar))
  }

  cat("\n  --- DEPARTMENT SUMMARY ---\n")
  for (r in 1:nrow(dept_tbl)) {
    cat(sprintf("  %-12s : %2d emp | Avg KPI = %.1f\n",
                dept_tbl$department[r], dept_tbl$Count[r], dept_tbl$Avg_KPI[r]))
  }

  cat("\n  --- TOP 3 PERFORMERS ---\n")
  for (r in 1:nrow(top3)) {
    cat(sprintf("  #%d  %-12s | Dept: %-12s | KPI: %.1f | Salary: %s\n",
                r,
                top3$employee_id[r],
                top3$department[r],
                top3$KPI_score[r],
                formatC(top3$salary[r], big.mark=",", format="d")))
  }
  cat("\n")
}

10.2 Run Automated Report for All Companies

# ============================================================
# TASK 8 (BONUS): Loop over all companies and print reports
# ============================================================

# Get sorted list of company IDs
all_companies <- sort(unique(dashboard_df$company_id))

# MAIN LOOP: generate report for each company automatically
for (cid in all_companies) {
  generate_company_report(cid, dashboard_df)
}

============================================================
  AUTOMATED REPORT — COMPANY C1
============================================================

  Total Employees    : 50
  Avg Monthly Salary : IDR 8,617
  Avg KPI Score      : 82.73
  Avg Perf Score     : 72.66
  Top Performers     : 15 (30.0% of workforce)

  --- KPI TIER BREAKDOWN ---
  At Risk      :  0 employees  
  Developing   : 14 employees  ████████
  Strong       : 21 employees  █████████████
  Elite        : 15 employees  █████████

  --- DEPARTMENT SUMMARY ---
  HR           :  5 emp | Avg KPI = 94.1
  IT           :  8 emp | Avg KPI = 87.1
  Operations   : 10 emp | Avg KPI = 84.8
  Marketing    : 12 emp | Avg KPI = 81.2
  Finance      : 15 emp | Avg KPI = 76.5

  --- TOP 3 PERFORMERS ---
  #1  C1_E25       | Dept: HR           | KPI: 99.9 | Salary: 4,893
  #2  C1_E1        | Dept: HR           | KPI: 99.7 | Salary: 10,017
  #3  C1_E27       | Dept: IT           | KPI: 99.6 | Salary: 5,724

============================================================
  AUTOMATED REPORT — COMPANY C2
============================================================

  Total Employees    : 50
  Avg Monthly Salary : IDR 8,928
  Avg KPI Score      : 80.05
  Avg Perf Score     : 74.62
  Top Performers     : 14 (28.0% of workforce)

  --- KPI TIER BREAKDOWN ---
  At Risk      :  0 employees  
  Developing   : 20 employees  ████████████
  Strong       : 16 employees  ██████████
  Elite        : 14 employees  ████████

  --- DEPARTMENT SUMMARY ---
  IT           :  7 emp | Avg KPI = 82.6
  Finance      : 12 emp | Avg KPI = 81.7
  HR           : 12 emp | Avg KPI = 81.0
  Marketing    : 10 emp | Avg KPI = 77.7
  Operations   :  9 emp | Avg KPI = 77.3

  --- TOP 3 PERFORMERS ---
  #1  C2_E12       | Dept: Finance      | KPI: 99.4 | Salary: 10,431
  #2  C2_E27       | Dept: Marketing    | KPI: 98.6 | Salary: 10,790
  #3  C2_E9        | Dept: Marketing    | KPI: 97.8 | Salary: 13,123

============================================================
  AUTOMATED REPORT — COMPANY C3
============================================================

  Total Employees    : 50
  Avg Monthly Salary : IDR 9,304
  Avg KPI Score      : 79.57
  Avg Perf Score     : 77.65
  Top Performers     : 11 (22.0% of workforce)

  --- KPI TIER BREAKDOWN ---
  At Risk      :  0 employees  
  Developing   : 19 employees  ███████████
  Strong       : 20 employees  ████████████
  Elite        : 11 employees  ███████

  --- DEPARTMENT SUMMARY ---
  Operations   : 11 emp | Avg KPI = 85.4
  Finance      :  6 emp | Avg KPI = 81.2
  IT           : 11 emp | Avg KPI = 78.2
  HR           : 13 emp | Avg KPI = 77.4
  Marketing    :  9 emp | Avg KPI = 76.2

  --- TOP 3 PERFORMERS ---
  #1  C3_E7        | Dept: Operations   | KPI: 99.4 | Salary: 5,298
  #2  C3_E25       | Dept: Operations   | KPI: 97.6 | Salary: 12,027
  #3  C3_E31       | Dept: IT           | KPI: 97.6 | Salary: 13,232

============================================================
  AUTOMATED REPORT — COMPANY C4
============================================================

  Total Employees    : 50
  Avg Monthly Salary : IDR 8,564
  Avg KPI Score      : 79.70
  Avg Perf Score     : 76.71
  Top Performers     : 10 (20.0% of workforce)

  --- KPI TIER BREAKDOWN ---
  At Risk      :  0 employees  
  Developing   : 17 employees  ██████████
  Strong       : 23 employees  ██████████████
  Elite        : 10 employees  ██████

  --- DEPARTMENT SUMMARY ---
  HR           :  9 emp | Avg KPI = 80.6
  Operations   : 17 emp | Avg KPI = 80.2
  Finance      : 11 emp | Avg KPI = 79.8
  IT           :  8 emp | Avg KPI = 79.0
  Marketing    :  5 emp | Avg KPI = 77.2

  --- TOP 3 PERFORMERS ---
  #1  C4_E7        | Dept: Finance      | KPI: 99.6 | Salary: 4,168
  #2  C4_E31       | Dept: Finance      | KPI: 99.1 | Salary: 8,905
  #3  C4_E22       | Dept: IT           | KPI: 97.4 | Salary: 13,243

============================================================
  AUTOMATED REPORT — COMPANY C5
============================================================

  Total Employees    : 50
  Avg Monthly Salary : IDR 8,492
  Avg KPI Score      : 80.15
  Avg Perf Score     : 71.12
  Top Performers     : 11 (22.0% of workforce)

  --- KPI TIER BREAKDOWN ---
  At Risk      :  0 employees  
  Developing   : 14 employees  ████████
  Strong       : 25 employees  ███████████████
  Elite        : 11 employees  ███████

  --- DEPARTMENT SUMMARY ---
  HR           : 12 emp | Avg KPI = 83.2
  Marketing    : 13 emp | Avg KPI = 82.8
  Operations   :  7 emp | Avg KPI = 79.8
  IT           :  7 emp | Avg KPI = 77.4
  Finance      : 11 emp | Avg KPI = 75.7

  --- TOP 3 PERFORMERS ---
  #1  C5_E8        | Dept: HR           | KPI: 98.4 | Salary: 3,197
  #2  C5_E12       | Dept: Marketing    | KPI: 97.5 | Salary: 7,022
  #3  C5_E26       | Dept: Marketing    | KPI: 97.1 | Salary: 3,995

10.3 Export Summary to CSV

# ============================================================
# TASK 8 (BONUS): Export company summary to CSV
# Demonstrates automated file output from a function
# ============================================================

# Build export data frame using a loop
export_rows <- list()

for (cid in all_companies) {
  cdata <- dashboard_df %>% filter(company_id == cid)

  export_rows[[cid]] <- data.frame(
    company_id       = cid,
    n_employees      = nrow(cdata),
    avg_salary       = round(mean(cdata$salary), 0),
    avg_kpi          = round(mean(cdata$KPI_score), 2),
    avg_performance  = round(mean(cdata$performance_score), 2),
    top_performers   = sum(cdata$top_performer == "Yes"),
    elite_count      = sum(cdata$kpi_tier == "Elite"),
    at_risk_count    = sum(cdata$kpi_tier == "At Risk")
  )
}

export_df <- bind_rows(export_rows)

# Write to CSV (will appear in knit working directory)
write.csv(export_df, "company_kpi_summary.csv", row.names = FALSE)

cat("CSV exported: company_kpi_summary.csv\n\n")

CSV exported: company_kpi_summary.csv

kable(export_df, caption = "Task 8 (Bonus) — Automated Export: Company KPI Summary") %>%
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)

Task 8 (Bonus) — Automated Export: Company KPI Summary
company_id	n_employees	avg_salary	avg_kpi	avg_performance	top_performers	elite_count
C1	50	8617	82.73	72.66	15	15
C2	50	8928	80.05	74.62	14	14
C3	50	9304	79.57	77.65	11	11
C4	50	8564	79.70	76.71	10	10
C5	50	8492	80.15	71.12	11	11

10.4 Automated Mini-Plot per Company

# ============================================================
# TASK 8 (BONUS): Loop-generated plots — KPI distribution
# Uses a for loop to build one ggplot per company,
# then patchworks them into a single dashboard figure
# ============================================================

# Build a list of plots (one per company)
plot_list <- list()

for (cid in all_companies) {
  cdata <- dashboard_df %>% filter(company_id == cid)

  p <- ggplot(cdata, aes(x = kpi_tier, fill = kpi_tier)) +
    geom_bar(alpha = 0.88, show.legend = FALSE) +
    geom_text(stat = "count", aes(label = after_stat(count)),
              vjust = -0.4, fontface = "bold", size = 3.8) +
    scale_fill_manual(values = tier_colors) +
    scale_y_continuous(limits = c(0, 28)) +
    labs(title = paste0("Company ", cid),
         x = NULL, y = "Employees") +
    theme_minimal(base_size = 10) +
    theme(
      plot.title         = element_text(face = "bold", color = "#311B92", size = 11),
      panel.grid.minor   = element_blank(),
      panel.grid.major.x = element_blank()
    )

  plot_list[[cid]] <- p
}

# Use gridExtra to arrange all 5 plots in one figure
library(gridExtra)
grid.arrange(grobs = plot_list, nrow = 2,
             top = "Task 8 (Bonus) — Automated KPI Tier Chart per Company")

Interpretation Task 8 demonstrates a fully automated report generation pipeline: a single generate_company_report() function encapsulates all the summary logic, and a for loop runs it automatically for every company without any manual repetition. The CSV export and the loop-generated grid of plots show how this pattern scales to any number of companies. In a real business context, this approach would allow analysts to refresh an entire portfolio-level HR report by simply re-running one loop — a core principle of reproducible data science workflows.

11. Summary and Conclusion

Comparison Table: All Tasks — Week 5

# ============================================================
# 11. Summary Table — all 8 tasks
# ============================================================

summary_df <- data.frame(
  Task = c("Task 1","Task 2","Task 3","Task 4",
           "Task 5","Task 6","Task 7","Task 8 ⭐"),
  Concept = c(
    "Dynamic Function + Nested Loop",
    "Nested Simulation + Discount Logic",
    "Categorization Function + Loop",
    "Multi-Company Data Generation",
    "Monte Carlo Pi Estimation",
    "Normalization & Feature Engineering",
    "KPI Dashboard — Mini Project",
    "Automated Report Generation (Bonus)"
  ),
  Key_Function = c(
    "compute_formula(x, formula)",
    "simulate_sales(n_sp, days) + get_discount()",
    "categorize_performance(sales_amount)",
    "generate_company_data(n_co, n_emp)",
    "monte_carlo_pi(n_points)",
    "normalize_columns(df) + z_score(df)",
    "Integrated pipeline across Tasks 4–6",
    "generate_company_report(cid, df)"
  ),
  Visualization = c(
    "Multi-line chart (log scale)",
    "Cumulative line chart per salesperson",
    "Bar chart + pie chart",
    "Scatter with regression + summary table",
    "Point plot with circle & sub-square overlay",
    "Faceted histogram + violin+boxplot",
    "Grouped bar + faceted scatter + area chart",
    "Automated text reports + grid of bar charts"
  ),
  check.names = FALSE
)

kable(summary_df,
      caption = "Practicum Week-5 — Complete Task Summary",
      col.names = c("Task","Concept","Key Function","Visualization")) %>%
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE, font_size = 13)

Practicum Week-5 — Complete Task Summary
Task	Concept	Key Function	Visualization
Task 1	Dynamic Function + Nested Loop	compute_formula(x, formula)	Multi-line chart (log scale)
Task 2	Nested Simulation + Discount Logic	simulate_sales(n_sp, days) + get_discount()	Cumulative line chart per salesperson
Task 3	Categorization Function + Loop	categorize_performance(sales_amount)	Bar chart + pie chart
Task 4	Multi-Company Data Generation	generate_company_data(n_co, n_emp)	Scatter with regression + summary table
Task 5	Monte Carlo Pi Estimation	monte_carlo_pi(n_points)	Point plot with circle & sub-square overlay
Task 6	Normalization & Feature Engineering	normalize_columns(df) + z_score(df)	Faceted histogram + violin+boxplot
Task 7	KPI Dashboard — Mini Project	Integrated pipeline across Tasks 4–6	Grouped bar + faceted scatter + area chart
Task 8 ⭐ \|	utomated Report Generation (Bonus) \|	enerate_company_report(cid, df) \|	utomated text reports + grid of bar charts \|

Key Conclusions

Practicum Week-5 has successfully demonstrated the integration of functions, loops, simulation, and ggplot2 visualization in R for advanced data science workflows. The following key points were established:

Function validation (Task 1) ensures robust pipelines — invalid inputs are caught before they produce silent errors downstream, a critical data engineering best practice.
Nested helper functions (Task 2) improve code organization by separating business logic (get_discount) from simulation orchestration (simulate_sales) — the single responsibility principle in action.
5-tier categorization (Task 3) demonstrates how conditional logic inside a loop replaces verbose if-else chains with a clean, scalable pattern applicable to any classification problem.
Nested loops (Tasks 4 & 7) are the natural structure for generating multi-level datasets — outer loop = entity, inner loop = record — mirroring the structure of real database ETL pipelines.
Monte Carlo simulation (Task 5) proves that large-scale random sampling converges to deterministic mathematical truths — the foundation of probabilistic modeling, Bayesian inference, and uncertainty quantification.
Loop-based normalization (Task 6) shows that feature engineering can be automated systematically across any number of numeric columns, ensuring reproducibility and scalability.
The KPI Dashboard (Task 7) integrates all skills into a production-grade analytics pipeline: generation → loop classification → multi-chart visualization — demonstrating end-to-end data science in R.
Automated reporting (Task 8 Bonus) extends the pipeline further: one function generates a complete company report, and a single loop produces the full portfolio analysis with CSV export and automated plots — the defining characteristic of reproducible analytics.

Together, these eight tasks confirm that reusable functions + disciplined looping + expressive visualization + automated reporting form the complete toolkit of a professional data scientist working in R.

Prakticum Week-5: FUNCTIONS & LOOPS + DATA SCIENCE

FRENKHY TONGA RETANG

2026-04-01

FUNCTIONS & LOOPS + DATA SCIENCE

PRAKTIKUM WEEK - 5

1. Introduction

Description

2. Required Libraries & Dataset Overview

Packages Used in This Practicum

Dataset Strategy

3. Task 1 — Dynamic Multi-Formula Function

Overview

3.1 Function Definition & Computation

Task Requirements

3.2 Visualization — All Formulas on One Graph

4. Task 2 — Nested Simulation: Multi-Sales & Discounts

Overview

4.1 Simulation Function with Nested Loops

4.2 Summary Statistics per Salesperson

4.3 Cumulative Sales Plot

5. Task 3 — Multi-Level Performance Categorization

Overview

5.1 Categorization Function

5.2 Bar Chart — Category Distribution

5.3 Pie Chart — Proportion View

6. Task 4 — Multi-Company Dataset Simulation

Overview

6.1 Data Generation with Nested Loops

6.2 Company Summary Table

6.3 Visualization — KPI vs Salary by Company

7. Task 5 — Monte Carlo Simulation: Pi & Probability

Overview

7.1 Monte Carlo Function

7.2 Visualization — Points Inside vs Outside Circle

8. Task 6 — Advanced Data Transformation & Feature Engineering

Overview

8.1 Transformation Functions

8.2 Feature Engineering

8.3 Distribution Comparison — Histograms (Faceted)

8.4 Violin + Boxplot — Salary by Performance Category

9. Task 7 — Mini Project: Company KPI Dashboard

Overview

9.1 Generate Full Dashboard Dataset

9.2 Summary per Company

9.3 Grouped Bar Chart — KPI Tiers per Company

9.4 Scatter with Regression — Salary vs KPI (Faceted)

9.5 Department Analysis — Avg Salary & KPI

9.6 Salary Distribution by KPI Tier (Area Chart)

10. Task 8 — Automated Report Generation

Overview

10.1 Automated Report Function

10.2 Run Automated Report for All Companies

10.3 Export Summary to CSV

10.4 Automated Mini-Plot per Company

11. Summary and Conclusion

Comparison Table: All Tasks — Week 5

Key Conclusions