Data Science Programming · Advanced Practicum

Advanced Practicum: Functions & Loops

Data Science Program Study — Functions, Loops, Simulation & Data Science Workflow

1 Dynamic Multi-Formula Function

Task: Build `compute_formula(x, formula)` supporting linear, quadratic, cubic, and exponential formulas. Plot all on the same graph for x = 1:20.

Nested loops to compute multiple formulas at once
Validate formula input with informative error messages
Plot all 4 formulas on the same graph

1.1 Function

# Function: compute_formula 
# Define a function to calculate values based on a specific mathematical formula
compute_formula <- function(x, formula) {
  # Define a list of accepted formula names
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
  
  # Check if the provided formula is in the list of valid options; if not, stop and show an error
  if (!formula %in% valid_formulas) {
    stop(paste("Invalid formula '", formula,
               "'. Choose from:", paste(valid_formulas, collapse = ",")))}
  
  # Determine the calculation logic based on the formula type selected
  y <- if (formula == "linear")      { 2 * x + 1          }
  else if (formula == "quadratic")   { x^2 - 3 * x + 2    }
  else if (formula == "cubic")       { x^3 - 5 * x^2 + 4  }
  else                               { exp(0.3 * x)        }
  
  # Return the calculated result
  return(y)}

# Define a sequence of x values from 1 to 20
x_vals   <- 1:20

# Define the set of formulas to be processed
formulas <- c("linear", "quadratic", "cubic", "exponential")

# Initialize an empty data frame to store the results
results  <- data.frame()

# Iterate through each formula type
for (f in formulas) {
  # Iterate through each value in the x sequence
  for (x in x_vals) {
    # Call the function to calculate y for the current x and formula
    y <- compute_formula(x, f)
    
    # Append the results (Formula name, x value, and rounded y value) to the data frame
    results <- rbind(results, data.frame(Formula = f, x = x, y = round(y, 4)))}}

Result · All Formula Values (x = 1:20)

Interpretation:The table above presents the computed y-values for four mathematical formulas — linear, quadratic, cubic, and exponential — across x values from 1 to 20. The linear formula (y = 2x + 1) produces a steady, predictable increase. The quadratic formula (y = x² − 3x + 2) shows a U-shaped growth pattern, starting small and accelerating. The cubic formula (y = x³ / 10) grows very rapidly at higher x values. The exponential formula (y = e^(x/5)) grows moderately but consistently, reflecting compound-rate behavior. Comparing the y columns across formulas highlights how different mathematical structures behave at the same x input.

1.2 Visualization

Visualization · Multi-Formula Plot

Interpretation: The multi-formula line plot clearly shows the diverging growth rates of the four formulas as x increases from 1 to 20. The linear line (blue) grows at a constant, moderate slope, remaining the lowest throughout. The quadratic line (gold) curves upward gently, overtaking the linear around x = 5. The cubic line (orange) starts flat but surges dramatically after x = 10, becoming the dominant formula by x = 20 — a clear sign of polynomial acceleration. The exponential line (green) grows steadily and surpasses the linear and quadratic lines but remains below the cubic in this range. This visualization underscores the importance of formula selection in modeling — small differences in function type produce vastly different outputs at scale.

2 Nested Simulation: Multi-Sales & Discounts

Task: Build `simulate_sales(n_salesperson, days)` generating sales data with conditional discounts and cumulative sales per salesperson.

Nested function to calculate cumulative sales per salesperson
Apply conditional discounts based on sales amount
Summary statistics and cumulative sales plot

2.1 Dataset & Function

# Set the seed for reproducibility to ensure the random numbers are the same each time
set.seed(42)

# Define a function to calculate the running total (cumulative sum) of a vector
cumulative_sales <- function(sales_vector) {
  # Initialize an empty vector for cumulative values and a counter for the running total
  cum <- c(); total <- 0
  # Loop through each sales value in the input vector, add it to the total, and store it
  for (s in sales_vector) { total <- total + s; cum <- c(cum, total) }
  # Return the vector of cumulative sums
  return(cum)}

# Define a function to simulate sales data for multiple people over several days
simulate_sales <- function(n_salesperson, days) {
  # Initialize an empty data frame to store the final simulation results
  result <- data.frame()
  # Loop through the number of salespersons specified
  for (sp in 1:n_salesperson) {
    # Generate random daily sales amounts between 500 and 5000 and round them
    daily_sales <- round(runif(days, min = 500, max = 5000))
    # Calculate the cumulative sales for the current salesperson
    cum_sales   <- cumulative_sales(daily_sales)
    # Loop through each day to calculate discounts and store individual records
    for (d in 1:days) {
      # Get the sales amount for the current day
      amt <- daily_sales[d]
      # Determine the discount rate based on the amount: 20% for >4000, 10% for >2500, else 5%
      discount_rate <- if (amt > 4000) { 0.20 } else if (amt > 2500) { 0.10 } else { 0.05 }
      # Combine the data into a new row and append it to the main result data frame
      result <- rbind(result, data.frame(
        sales_id      = paste0("SP", sp, "_D", d),
        salesperson   = paste("Salesperson", sp),
        day           = d,
        sales_amount  = amt,
        discount_rate = discount_rate,
        net_sales     = round(amt * (1 - discount_rate)),
        cum_sales     = cum_sales[d]
      ))}}
  # Return the complete data frame containing all simulated records
  return(result)}

# Execute the simulation for 5 salespersons over 10 days and store the result
sales_data <- simulate_sales(n_salesperson = 5, days = 10)

Result · Sales Simulation Data

Interpretation: The sales simulation table records 50 transactions (5 salespersons × 10 days). Each row captures the daily sales amount, the applied discount rate, the net sales after discount, and the running cumulative sales total. Discount rates are conditionally applied: 20% for sales above 4,000, 10% for sales between 2,501–4,000, and 5% for sales at or below 2,500. This structure reflects a tiered incentive system where higher sales generate larger discounts, potentially impacting net revenue. The cumulative sales column allows tracking of each salesperson’s total earnings progress over the 10-day period.

2.2 Summary

Result · Summary Statistics per Salesperson

Interpretation: The summary statistics table aggregates each salesperson’s 10-day performance. Key metrics include Total Sales (gross revenue before discounts), Avg Daily Sales, the Max and Min single-day sales, and the Max Cumulative Sales reached by day 10. Differences in total and average sales across salespersons reflect the random variability in the simulation. Salespersons with higher average daily sales will show significantly larger cumulative totals, while those with lower averages may have more consistent but smaller outputs. This table provides a quick benchmark for comparing individual salesperson productivity.

2.3 Visualization

Visualization · Cumulative Sales Plot

Interpretation: The cumulative sales line chart tracks each salesperson’s running total from Day 1 to Day 10. All five salespersons start from a similar base, but their trajectories diverge as the simulation progresses, reflecting the random daily sales variation. Steeper lines indicate salespersons who recorded consistently higher daily sales. By Day 10, the gap between the highest and lowest cumulative totals reveals the overall range of performance. This chart is useful for identifying top performers over time and spotting whether any salesperson experienced a plateau or surge during the 10-day period.

3 Multi-Level Performance Categorization

Task: Build `categorize_performance(sales_amount)` with 5 categories. Calculate percentages per category and visualise with bar plot and pie chart.

5 categories: Excellent, Very Good, Good, Average, Poor
Loop through vector to assign category
Bar plot and pie chart of distribution

3.1 Function

# Define a function to assign a performance label based on sales amount
categorize_performance <- function(sales_amount) {
  # Initialize an empty vector to store the results
  categories <- c()
  # Loop through each sales figure provided in the input vector
  for (s in sales_amount) {
    # Determine the category based on specific threshold values
    cat <- if (s >= 4500)        "Excellent"
           else if (s >= 3500)  "Very Good"
           else if (s >= 2500)  "Good"
           else if (s >= 1500)  "Average"
           else                 "Poor"
    # Append the determined category to the results vector
    categories <- c(categories, cat)}
  # Return the complete vector of performance categories
  return(categories)}

# Extract the sales_amount column from the existing sales_data data frame
all_sales <- sales_data$sales_amount
# Apply the categorization function to all sales figures
perf_cats <- categorize_performance(all_sales)
# Create a new data frame pairing the original sales amounts with their new categories
perf_df   <- data.frame(sales_amount = all_sales, category = perf_cats)

# Summarize the performance data using the pipe operator
perf_summary <- perf_df %>%
  # Count the number of occurrences for each category
  count(category) %>%
  # Add a percentage column and convert category to a factor with a specific order
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    category   = factor(category, levels = c("Excellent","Very Good","Good","Average","Poor"))) %>%
  # Sort the final summary table based on the category ranking
  arrange(category)

Result · Performance Category Distribution

Interpretation: This table summarizes how the 50 sales transactions (5 salespersons × 10 days) are distributed across 5 performance tiers. Categories are assigned based on thresholds: Excellent (≥4,500), Very Good (≥3,500), Good (≥2,500), Average (≥1,500), and Poor (<1,500). Since sales values were generated uniformly between 500 and 5,000, the distribution should appear relatively spread across tiers. The percentage column provides a normalized view of the proportion in each category, making it easier to compare performance levels without being influenced by the total count.

3.2 Visualization

Visualization · Performance Bar Chart

Interpretation: The bar chart visualizes the count of transactions in each performance category. The height of each bar directly reflects how frequently that performance tier was achieved. A taller bar in the Good or Average range would indicate that most sales fell in the mid-range, which is typical of a uniform random distribution. If the Excellent bar is shorter, it reflects that high-sales days (≥4,500) are less common. The labeled counts and percentages above each bar allow for quick comparison without needing to refer back to the table. Color-coding by tier (green = Excellent, red = Poor) further reinforces the performance ranking intuitively.

Visualization · Performance Pie Chart

Interpretation: The donut pie chart provides a proportional view of how the 50 simulated transactions are distributed across the five performance tiers. Each segment’s size is proportional to the count (and percentage) of sales in that category. The center hole distinguishes it as a donut chart, which reduces visual clutter while still conveying the full distribution. A relatively even distribution across all five tiers would confirm the near-uniform nature of the simulated sales data. Dominant segments in Good or Average indicate that moderate sales outcomes were most frequent, while smaller slices for Excellent and Poor reflect the rarity of extreme outcomes in the dataset.

4 Multi-Company Dataset Simulation

Task: Build `generate_company_data(n_company, n_employees)` generating employee records with performance and KPI scores. Summarise per company and identify top performers.

Nested loops per company and employee
Conditional logic: top performers where KPI > 90
Summary per company: avg salary, avg performance, max KPI

4.1 Dataset & Function

# Set the seed for reproducibility to ensure the random number generation is consistent
set.seed(2024)

# Define a function to generate simulated employee data for a specific number of companies and employees
generate_company_data <- function(n_company, n_employees) {
  # Create a vector of possible department names
  departments <- c("Finance","HR","IT","Marketing","Operations","Sales")
  # Initialize an empty data frame to collect all generated records
  all_data    <- data.frame()
  # Outer loop to iterate through each company
  for (c in 1:n_company) {
    # Inner loop to iterate through each employee within the current company
    for (e in 1:n_employees) {
      # Generate a random performance score between 50 and 100, rounded to 1 decimal place
      perf_score    <- round(runif(1, 50, 100), 1)
      # Generate a random KPI score between 55 and 100, rounded to 1 decimal place
      kpi_score     <- round(runif(1, 55, 100), 1)
      # Flag as "Yes" if the KPI score is above 90, otherwise "No"
      top_performer <- ifelse(kpi_score > 90, "Yes", "No")
      # Assign a salary range based on the performance score threshold
      base_salary   <- if (perf_score >= 85)      { round(runif(1, 9000, 15000)) }
                       else if (perf_score >= 70) { round(runif(1, 6000, 9000))  }
                       else                       { round(runif(1, 3000, 6000))  }
      # Create a data frame for the current employee and append it to the master data frame
      all_data <- rbind(all_data, data.frame(
        company_id        = paste0("C", c),
        employee_id       = paste0("C", c, "_E", e),
        department        = sample(departments, 1),
        salary            = base_salary,
        performance_score = perf_score,
        KPI_score         = kpi_score,
        top_performer     = top_performer
      ))}}
  # Return the full data frame containing all companies and employees
  return(all_data)}

# Call the function to generate data for 4 companies with 15 employees each
company_data    <- generate_company_data(n_company = 4, n_employees = 15)

# Summarize the generated data using the pipe operator
company_summary <- company_data %>%
  # Group the data by company ID to perform calculations per company
  group_by(company_id) %>%
  # Calculate various averages and totals for each group
  summarise(
    Avg_Salary      = round(mean(salary), 0),
    Avg_Performance = round(mean(performance_score), 2),
    Avg_KPI         = round(mean(KPI_score), 2),
    Max_KPI         = max(KPI_score),
    Top_Performers  = sum(top_performer == "Yes"),
    Total_Employees = n(),
    # Drop the grouping structure to return a standard data frame
    .groups = "drop")

Result · Employee Dataset

Interpretation: The employee dataset table contains 60 records (4 companies × 15 employees each), with columns for company ID, employee ID, department, salary, performance score, KPI score, and top performer status. Salaries are conditionally assigned based on performance thresholds — higher-performing employees (score ≥ 85) earn between 9,000–15,000, mid-range performers (≥ 70) earn 6,000–9,000, and lower performers receive 3,000–6,000. Employees flagged as “Yes” in the top_performer column have a KPI score above 90, making them the elite subset of the workforce. This table serves as the raw data source for all subsequent company-level aggregations.

Result · Company Summary

Interpretation: The company summary table consolidates key HR metrics across the four simulated companies. Avg_Salary reflects the mean compensation level, which is influenced by the mix of high, mid, and low performers per company. Avg_Performance and Avg_KPI indicate the general quality of the workforce — a higher average KPI suggests a more productive company. Max_KPI highlights the peak individual performer, while Top_Performers (employees with KPI > 90) shows how many exceptional employees each company has. Companies with more top performers and higher average KPI generally indicate a stronger talent pool, though this simulation is randomly generated and results will vary each run.

4.2 Visualization

Visualization · Average Salary per Company

Interpretation: The average salary bar chart compares the mean compensation across the four companies (C1–C4). Differences in bar heights reflect variations in each company’s overall workforce composition — a company with more high-performing employees (performance score ≥ 85) will naturally have a higher average salary due to the tiered salary assignment logic. A company with a noticeably taller bar suggests it employs a greater proportion of top-tier performers, which may also align with higher average KPI and performance scores seen in the summary table. This chart is useful for benchmarking compensation fairness and talent investment across companies.

Visualization · Top Performers per Company

Interpretation: The top performers bar chart counts the number of employees with a KPI score exceeding 90 in each company. Since KPI scores were generated uniformly between 55 and 100, roughly 22% of employees are expected to cross the 90 threshold (the top 10 out of 45 points of the range). Companies with a higher count of top performers demonstrate a statistically favorable distribution — this could indicate a more competitive or incentivized work environment in a real-world context. Comparing this chart with the average salary chart can reveal whether higher-performing companies also invest more in compensation, which would suggest a strong pay-for-performance culture.

5 Monte Carlo Simulation: Pi & Probability

Task: Build `monte_carlo_pi(n_points)` to estimate π. Additionally compute the probability of random points falling inside a sub-square. Visualise points inside vs outside the circle.

Loop for iterations to count points in circle
Compute π estimate from ratio of inside points
Compute probability of falling in a sub-square (0–0.5 × 0–0.5)

5.1 Function

# Set the seed for reproducibility to ensure the random points generated are the same every time
set.seed(99)

# Define a function to estimate the value of Pi using the Monte Carlo method
monte_carlo_pi <- function(n_points) {
  # Generate random x-coordinates between -1 and 1
  x <- runif(n_points, -1, 1)
  # Generate random y-coordinates between -1 and 1
  y <- runif(n_points, -1, 1)
  # Initialize a counter for points that fall inside the unit circle
  inside_circle    <- 0
  # Initialize a counter for points that fall within a specific small square (0 to 0.5 on both axes)
  inside_subsquare <- 0
  
  # Loop through every generated point to check its location
  for (i in 1:n_points) {
    # Calculate the distance of the point from the origin (0,0)
    dist <- sqrt(x[i]^2 + y[i]^2)
    # If the distance is 1 or less, the point is inside the circle; increment the counter
    if (dist <= 1) { inside_circle <- inside_circle + 1 }
    # Check if the point falls within the bounds of the smaller square (0, 0) to (0.5, 0.5)
    if (x[i] >= 0 && x[i] <= 0.5 && y[i] >= 0 && y[i] <= 0.5) {
      inside_subsquare <- inside_subsquare + 1}}
  
  # Estimate Pi: (Points in Circle / Total Points) * Area of the bounding square (4)
  pi_estimate <- 4 * inside_circle / n_points
  # Calculate the empirical probability of a point falling in the small square
  prob_subsq  <- inside_subsquare / n_points
  
  # Return a list containing the Pi estimate, counts, probability, and the raw coordinates/status
  list(pi_estimate = pi_estimate, inside_circle = inside_circle,
       n_points = n_points, prob_subsquare = round(prob_subsq, 4),
       x = x, y = y, in_circle = sqrt(x^2 + y^2) <= 1)}

# Execute the Monte Carlo simulation with 10,000 points and store the results in 'mc'
mc <- monte_carlo_pi(10000)

Result · Monte Carlo π Estimation

Interpretation: This table summarizes the results of the Monte Carlo π estimation using 10,000 randomly placed points. The Estimated π is calculated as 4 × (points inside circle / total points), based on the geometric relationship between a unit circle and its bounding square. The closer the estimate is to the true value of π (≈ 3.141593), the more accurate the simulation. The Absolute Error quantifies this difference — a smaller error confirms the high accuracy achievable with a large number of points. The P(Sub-square) value (~0.0625) represents the empirical probability that a random point falls within the small [0, 0.5] × [0, 0.5] sub-square, which has a theoretical probability of 0.0625 (area = 0.25 out of the total square area of 4), validating the simulation’s randomness.

5.2 Visualization

Visualization · Monte Carlo π Scatter

Interpretation: The Monte Carlo scatter plot displays 3,000 randomly sampled points (from the 10,000 total) plotted on a 2D plane bounded by [-1, 1] on both axes. Blue points fall inside the unit circle (distance from origin ≤ 1), while orange points fall outside. The solid black circle is the unit circle boundary, and the dashed gold square marks the [0, 0.5] × [0, 0.5] sub-square region. The ratio of blue to total points approximates π/4, making this a visual proof of the Monte Carlo method. The denser the blue region relative to the square, the more accurate the π estimate. The sub-square overlay also shows visually how the empirical probability of ~6.25% is derived from the small corner region of the full bounding square.

6 Advanced Data Transformation & Feature Engineering

Task: Build `normalize_columns(df)` and `z_score(df)` plus create new engineered features. Compare distributions before and after transformation.

Loop-based min-max normalization and z-score standardisation
Create new features: performance_category and salary_bracket
Histograms and boxplots before and after transformation

6.1 Function

# Define a function to scale numeric columns to a range between 0 and 1 (Min-Max Normalization)
normalize_columns <- function(df) {
  # Identify the names of all columns in the data frame that are numeric
  num_cols <- names(df)[sapply(df, is.numeric)]
  # Create a copy of the original data frame to store the normalized values
  norm_df  <- df
  # Loop through each identified numeric column
  for (col in num_cols) {
    # Find the minimum value of the current column, ignoring missing values
    min_val        <- min(df[[col]], na.rm = TRUE)
    # Find the maximum value of the current column, ignoring missing values
    max_val        <- max(df[[col]], na.rm = TRUE)
    # Apply the normalization formula: (x - min) / (max - min)
    norm_df[[col]] <- (df[[col]] - min_val) / (max_val - min_val)}
  # Return the data frame with normalized numeric columns
  return(norm_df)}

# Define a function to standardize numeric columns using Z-score (mean = 0, std dev = 1)
z_score <- function(df) {
  # Identify the names of all numeric columns
  num_cols <- names(df)[sapply(df, is.numeric)]
  # Create a copy of the original data frame to store the standardized values
  z_df     <- df
  # Loop through each numeric column
  for (col in num_cols) {
    # Calculate the average (mean) of the current column
    mu         <- mean(df[[col]], na.rm = TRUE)
    # Calculate the standard deviation of the current column
    sigma      <- sd(df[[col]], na.rm = TRUE)
    # Apply the Z-score formula: (x - mean) / standard deviation
    z_df[[col]] <- (df[[col]] - mu) / sigma}
  # Return the data frame with standardized values
  return(z_df)}

# Select specific numeric columns from the existing company_data for processing
numeric_df <- company_data %>% select(salary, performance_score, KPI_score)
# Create a normalized version of the selected columns
norm_df    <- normalize_columns(numeric_df)
# Create a standardized (Z-score) version of the selected columns
z_df       <- z_score(numeric_df)

# Create a new feature-engineered data frame based on company_data
feat_df <- company_data %>%
  # Use mutate to add new categorical columns based on numeric thresholds
  mutate(
    # Categorize performance scores into four descriptive levels
    performance_category = case_when(
      performance_score >= 90 ~ "Exceptional",
      performance_score >= 75 ~ "Strong",
      performance_score >= 60 ~ "Moderate",
      TRUE                    ~ "Needs Improvement"),
    # Categorize salary amounts into four professional brackets
    salary_bracket = case_when(
      salary >= 12000 ~ "Senior",
      salary >= 7500  ~ "Mid-level",
      salary >= 4500  ~ "Junior",
      TRUE            ~ "Entry"))

Result · Normalized vs Z-Score

Interpretation: This table compares the first 10 rows of salary and performance score values across three forms: Original (raw values), Min-Max Normalized (scaled to 0–1), and Z-Score Standardized (mean = 0, std = 1). The normalized salary values compress the wide salary range (e.g., 3,000–15,000) into a 0–1 interval, making all features comparable in magnitude — essential for distance-based machine learning algorithms. The Z-score column transforms values so that 0 represents the mean salary, positive values indicate above-average, and negative values indicate below-average. Both transformations preserve the relative ordering of observations while changing the scale, which is key for fair feature comparison in data science workflows.

6.2 Visualization

Visualization · Salary Distribution Histogram

Interpretation: The interactive histogram allows switching between three views of the salary distribution: Original, Min-Max Normalized, and Z-Score. The Original histogram reveals the underlying shape of the salary data — likely a roughly uniform or multimodal pattern due to the three salary brackets (Entry, Junior, Senior) assigned by performance thresholds. The Min-Max Normalized histogram preserves the exact same shape but rescales the x-axis to [0, 1], confirming that normalization is a linear transformation. The Z-Score histogram centers the distribution at 0 with spread measured in standard deviations, making it easy to identify how many employees fall within one or two standard deviations from the mean salary.

Visualization · Boxplot All Variables × Transformation

Interpretation: The grouped boxplot compares all three numeric variables — salary, performance_score, and KPI_score — side by side across the three transformation types. For the Original data, the salary box spans a wide range (3,000–15,000), dwarfing the performance and KPI score boxes which sit in the 55–100 range. After Min-Max Normalization, all three boxes are compressed into [0, 1], making them directly comparable in width and position. After Z-Score Standardization, all boxes are centered near 0 with comparable spread, but the salary box shows noticeably higher variance (wider IQR) than the score variables, reflecting the greater real-world spread in salary values. This chart powerfully demonstrates why feature scaling is essential before applying machine learning algorithms sensitive to feature magnitude.

7 Mini Project: Company KPI Dashboard & Simulation

Task: Generate a dataset for 5–10 companies with 50–200 employees each. Summarise per company, loop to categorize KPI tiers, and produce advanced visualisations.

Columns: employee_id, company_id, salary, performance_score, KPI_score, department
Loop to categorize employees into KPI tiers
Grouped bar charts, scatter plots with regression lines

7.1 Function

# Set the seed for reproducibility to ensure the random generation remains consistent
set.seed(777)

# Generate simulated company data for 6 companies with 30 employees each using the predefined function
kpi_data  <- generate_company_data(n_company = 6, n_employees = 30)

# Initialize an empty vector to store the KPI tier for each employee
kpi_tier  <- c()

# Loop through every row of the generated kpi_data data frame
for (i in 1:nrow(kpi_data)) {
  # Extract the KPI score for the current employee (row)
  score    <- kpi_data$KPI_score[i]
  # Assign a tier label based on the KPI score thresholds
  tier     <- if (score >= 90) "Platinum" else if (score >= 75) "Gold"
              else if (score >= 60) "Silver" else "Bronze"
  # Append the determined tier to the kpi_tier vector
  kpi_tier <- c(kpi_tier, tier)}

# Add the tier vector to the data frame as a factor with a specific ordinal level
kpi_data$kpi_tier <- factor(kpi_tier, levels = c("Platinum","Gold","Silver","Bronze"))

# Create a summary table by grouping the data by company
kpi_summary <- kpi_data %>%
  # Group the rows by the company_id column
  group_by(company_id) %>%
  # Calculate summary metrics for each company
  summarise(
    # Calculate the average salary rounded to the nearest whole number
    Avg_Salary      = round(mean(salary), 0),
    # Calculate the average KPI score rounded to 2 decimal places
    Avg_KPI         = round(mean(KPI_score), 2),
    # Calculate the average performance score rounded to 2 decimal places
    Avg_Performance = round(mean(performance_score), 2),
    # Count the number of individuals flagged as "Yes" in the top_performer column
    Top_Performers  = sum(top_performer == "Yes"),
    # Count the number of individuals who achieved the "Platinum" KPI tier
    Platinum_Count  = sum(kpi_tier == "Platinum"),
    # Count the total number of employees in the company
    Total_Employees = n(),
    # Remove the grouping structure after the summary is complete
    .groups = "drop")

Result · KPI Dashboard Summary per Company

Interpretation: The KPI Dashboard Summary table consolidates workforce metrics across 6 companies (C1–C6), each with 30 employees. Key columns include Avg_Salary (mean compensation), Avg_KPI and Avg_Performance (overall workforce quality indicators), Top_Performers (count of employees with KPI > 90), Platinum_Count (employees in the highest KPI tier ≥ 90), and Total_Employees. Companies with a higher Platinum_Count and Top_Performers ratio tend to have a stronger high-achieving workforce. Cross-referencing Avg_Salary with Avg_KPI reveals whether higher-paid companies also deliver better KPI outcomes — a critical insight for HR decision-making and resource allocation.

7.2 Visualization

Visualization · KPI Tier Distribution per Company

Interpretation: The grouped bar chart shows how employees are distributed across four KPI tiers — Platinum (≥90), Gold (≥75), Silver (≥60), and Bronze (<60) — for each of the six companies. Since KPI scores are uniformly distributed between 55 and 100, the Silver and Gold tiers are expected to dominate (covering the wider 60–90 range), while Platinum and Bronze represent the extremes. Companies with taller Platinum bars have a disproportionately strong elite workforce, while a taller Bronze bar may indicate underperformance relative to peers. Comparing the tier mix across C1–C6 allows management to benchmark workforce quality and identify which companies may need talent development investment.

Visualization · Average KPI Score by Department

Interpretation: The horizontal bar chart ranks all six departments by their average KPI score across all companies. The color gradient (gold to navy) provides an additional visual cue — darker bars represent higher-performing departments. Since department assignments were made randomly in the simulation, any apparent differences reflect random variation rather than real-world domain effects. However, in a real organizational context, this chart would pinpoint which departments consistently deliver stronger KPI outcomes. Departments at the bottom of the chart would be candidates for performance improvement programs or resource reallocation, while top-ranked departments could serve as benchmarks for best practices.

Visualization · Performance vs KPI Score with Regression

Interpretation: The scatter plot with regression lines explores the relationship between Performance Score (x-axis) and KPI Score (y-axis) for each of the six companies. Each company’s data points are color-coded, with a fitted linear regression line overlaid to show the trend direction. A positive slope in the regression line indicates that employees with higher performance scores also tend to have higher KPI scores, suggesting alignment between these two metrics. If most regression lines slope upward, it validates the internal consistency of the simulation. Companies where the regression line is steeper show a stronger performance-to-KPI relationship, while a flat line suggests these two variables are independent for that company — a counterintuitive result worth investigating in a real dataset.

Visualization · Salary Distribution by KPI Tier

Interpretation: The violin plot reveals the full salary distribution shape for each KPI tier — Platinum, Gold, Silver, and Bronze. The width of the violin at any point indicates the density of employees at that salary level. The embedded boxplot shows the median (horizontal line), interquartile range (box), and the mean line. Since salary is assigned based on performance score thresholds — and performance score is independent of KPI score in this simulation — the salary distributions across tiers may overlap significantly. However, if Platinum employees (highest KPI) also tend to have higher performance scores, their salary violin would be weighted toward the upper range. This chart is particularly useful for identifying whether top-tier KPI performers are also being compensated at a level commensurate with their achievement.

8 Automated Report Generation

Task (Bonus): Use functions and loops to generate an automated summary report per company — with tables, plots, and optional CSV export.

Loop through each company and generate individual summary
Automated tables and plots per company
Optional: export each company summary to CSV

8.1 Function

# Define a function to generate a detailed report for a specific company
generate_company_report <- function(data, cid, export_csv = FALSE) {
  # Filter the main dataset to include only rows matching the given company ID
  sub <- data %>% filter(company_id == cid)
  # Create a list containing various summary statistics for the company
  report <- list(
    company      = cid,
    n_employees  = nrow(sub),
    # Calculate average salary rounded to the nearest whole number
    avg_salary   = round(mean(sub$salary), 0),
    # Calculate average KPI score rounded to 2 decimal places
    avg_kpi      = round(mean(sub$KPI_score), 2),
    # Calculate average performance score rounded to 2 decimal places
    avg_perf     = round(mean(sub$performance_score), 2),
    # Calculate the percentage of top performers in the company
    top_perf_pct = round(mean(sub$top_performer == "Yes") * 100, 1),
    # Create a frequency table of KPI tiers (e.g., Platinum, Gold, etc.)
    kpi_tiers    = table(sub$kpi_tier),
    # Create a sub-summary grouped by department
    dept_summary = sub %>%
      group_by(department) %>%
      summarise(Avg_KPI = round(mean(KPI_score),2), Avg_Salary = round(mean(salary),0),
                Count = n(), .groups = "drop") %>%
      # Sort the department summary by the highest average KPI score
      arrange(desc(Avg_KPI)))
  # If export_csv is TRUE, save the filtered company data to a physical CSV file
  if (export_csv) { write.csv(sub, file = paste0(cid, "_report.csv"), row.names = FALSE) }
  # Return the generated list object
  return(report)}

# Identify all unique company IDs present in the dataset
company_ids <- unique(kpi_data$company_id)
# Initialize an empty list to store multiple company reports
all_reports <- list()
# Loop through each unique company ID to generate its individual report
for (cid in company_ids) {
  # Call the report function and store the result in the list using the company ID as the key
  all_reports[[cid]] <- generate_company_report(kpi_data, cid, export_csv = TRUE)}

# Initialize an empty data frame to consolidate high-level report data
report_summary <- data.frame()
# Loop through the generated reports to build a summary table
for (cid in company_ids) {
  # Extract the report object for the current company ID
  r <- all_reports[[cid]]
  # Append a new row to the summary data frame with formatted values
  report_summary <- rbind(report_summary, data.frame(
    Company           = r$company,
    Employees         = r$n_employees,
    # Format salary with commas as thousands separators (e.g., 10,000)
    Avg_Salary        = formatC(r$avg_salary, format="d", big.mark=","),
    Avg_KPI           = r$avg_kpi,
    Avg_Performance   = r$avg_perf,
    # Append a percentage sign to the top performer metric
    Top_Performer_Pct = paste0(r$top_perf_pct, "%")
  ))}

# Define a function to render the summary data frame into a PDF document
render_summary_to_pdf <- function(df) {
  # Define the filename for the temporary RMarkdown file
  temp_rmd <- "temp_summary.Rmd"
  
  # Generate the content of the temporary RMarkdown file with YAML header and code chunks
  writeLines(c(
    "---",
    "title: 'Executive Summary Report'",
    "author: 'Wulan Gustika Antasya Tumanggor'",
    "output: pdf_document",
    "params:",                
    "  data: !r data.frame()", # Define a parameter to accept a data frame from the main environment
    "---",
    "",
    "```{r, echo=FALSE}",
    "# Use the data passed through params to create a formatted table",
    "knitr::kable(params$data, caption = 'Rekapitulasi Performa Perusahaan')",
    "```"), temp_rmd)
  
# Use the rmarkdown package to compile the temporary file into the final PDF
  rmarkdown::render(
    input       = temp_rmd,              # Source file
    params      = list(data = df),       # Pass the summary data frame into the Rmd parameters
    output_file = "Final_Summary_Report.pdf", # Name of the generated PDF file
    envir       = new.env(),             # Run in a clean environment to avoid variable conflicts
    quiet       = TRUE)                   # Suppress processing logs for a cleaner output
  
  # Clean up by removing the temporary RMarkdown file after the PDF is successfully created
  if (file.exists(temp_rmd)) {
    file.remove(temp_rmd)}}

# Execute the function to generate the final automated PDF report
render_summary_to_pdf(report_summary)

Result · Company Report Summary

Interpretation: The automated company report summary table is the final consolidated output generated by looping through all six companies and calling `generate_company_report()` for each. It presents Employees (30 per company), Avg_Salary, Avg_KPI, Avg_Performance, and Top_Performer_Pct (percentage of employees with KPI > 90). This table is designed to function as an executive dashboard — a single row per company makes cross-company comparison immediate. The Top_Performer_Pct column is especially valuable: companies with a higher percentage of top performers are outperforming peers on a proportional basis, even if their total headcount is the same. This table is also programmatically exported to a PDF, demonstrating an end-to-end automated reporting pipeline.

8.2 Visualization

Visualization · Average KPI per Company

Interpretation: This bar chart visualizes the Average KPI Score per company as generated by the automated reporting loop. Each bar’s height represents how well the workforce of that company performs on average against the KPI metric. Since this dataset was produced with `set.seed(777)`, the values are deterministic and reproducible. Companies with a higher average KPI bar are delivering stronger collective output. Small differences between companies (e.g., a 1–2 point gap in Avg KPI) may be statistically insignificant given the random simulation, but in a real-world context even small KPI differentials can translate to significant business outcomes. This chart is part of the automated executive summary and should be read alongside the top performer percentage for a complete performance picture.

Visualization · Top Performer % per Company

Interpretation: The Top Performer Percentage bar chart shows what proportion of each company’s workforce achieves a KPI score above 90 (the top performer threshold). This percentage-based view normalizes for company size and makes comparisons fair. A company with a higher bar not only has more top performers in absolute terms but also a deeper bench of high-achievers relative to its total workforce. In a real business context, a company maintaining 20–30% top performers consistently would be considered highly competitive. Comparing this chart with the Avg KPI chart adds another layer of insight: a company might have a modest average KPI but a high top-performer percentage (indicating a few standout individuals pulling up the average), or vice versa (broad mediocrity). Together, these two final charts complete the automated reporting pipeline built across all eight tasks of this practicum.

· Functions & Loops · Data Science Programming · ITSB · April 2026

Advanced Practicum: Functions & Loops

Wulan Gustika A. T.

Bakti Siregar, M.Sc., CDS

Institut Teknologi Sains Bandung

1 Dynamic Multi-Formula Function

Task: Build `compute_formula(x, formula)` supporting linear, quadratic, cubic, and exponential formulas. Plot all on the same graph for x = 1:20.

1.1 Function

1.2 Visualization

2 Nested Simulation: Multi-Sales & Discounts

Task: Build `simulate_sales(n_salesperson, days)` generating sales data with conditional discounts and cumulative sales per salesperson.

2.1 Dataset & Function

2.2 Summary

2.3 Visualization

3 Multi-Level Performance Categorization

Task: Build `categorize_performance(sales_amount)` with 5 categories. Calculate percentages per category and visualise with bar plot and pie chart.

3.1 Function

3.2 Visualization

4 Multi-Company Dataset Simulation

Task: Build `generate_company_data(n_company, n_employees)` generating employee records with performance and KPI scores. Summarise per company and identify top performers.

4.1 Dataset & Function

4.2 Visualization

5 Monte Carlo Simulation: Pi & Probability

Task: Build `monte_carlo_pi(n_points)` to estimate π. Additionally compute the probability of random points falling inside a sub-square. Visualise points inside vs outside the circle.

5.1 Function

5.2 Visualization

6 Advanced Data Transformation & Feature Engineering

Task: Build `normalize_columns(df)` and `z_score(df)` plus create new engineered features. Compare distributions before and after transformation.

6.1 Function

6.2 Visualization

7 Mini Project: Company KPI Dashboard & Simulation

Task: Generate a dataset for 5–10 companies with 50–200 employees each. Summarise per company, loop to categorize KPI tiers, and produce advanced visualisations.

7.1 Function

7.2 Visualization

8 Automated Report Generation

Task (Bonus): Use functions and loops to generate an automated summary report per company — with tables, plots, and optional CSV export.

8.1 Function

8.2 Visualization

Advanced Practicum: Functions & Loops

Wulan Gustika A. T.

Bakti Siregar, M.Sc., CDS

Institut Teknologi Sains Bandung

1 Dynamic Multi-Formula Function

Task: Build compute_formula(x, formula) supporting linear, quadratic, cubic, and exponential formulas. Plot all on the same graph for x = 1:20.

1.1 Function

1.2 Visualization

2 Nested Simulation: Multi-Sales & Discounts

Task: Build simulate_sales(n_salesperson, days) generating sales data with conditional discounts and cumulative sales per salesperson.

2.1 Dataset & Function

2.2 Summary

2.3 Visualization

3 Multi-Level Performance Categorization

Task: Build categorize_performance(sales_amount) with 5 categories. Calculate percentages per category and visualise with bar plot and pie chart.

3.1 Function

3.2 Visualization

4 Multi-Company Dataset Simulation

Task: Build generate_company_data(n_company, n_employees) generating employee records with performance and KPI scores. Summarise per company and identify top performers.

4.1 Dataset & Function

4.2 Visualization

5 Monte Carlo Simulation: Pi & Probability

Task: Build monte_carlo_pi(n_points) to estimate π. Additionally compute the probability of random points falling inside a sub-square. Visualise points inside vs outside the circle.

5.1 Function

5.2 Visualization

6 Advanced Data Transformation & Feature Engineering

Task: Build normalize_columns(df) and z_score(df) plus create new engineered features. Compare distributions before and after transformation.

6.1 Function

6.2 Visualization

7 Mini Project: Company KPI Dashboard & Simulation

Task: Generate a dataset for 5–10 companies with 50–200 employees each. Summarise per company, loop to categorize KPI tiers, and produce advanced visualisations.

7.1 Function

7.2 Visualization

8 Automated Report Generation

Task (Bonus): Use functions and loops to generate an automated summary report per company — with tables, plots, and optional CSV export.

8.1 Function

8.2 Visualization

Task: Build `compute_formula(x, formula)` supporting linear, quadratic, cubic, and exponential formulas. Plot all on the same graph for x = 1:20.

Task: Build `simulate_sales(n_salesperson, days)` generating sales data with conditional discounts and cumulative sales per salesperson.

Task: Build `categorize_performance(sales_amount)` with 5 categories. Calculate percentages per category and visualise with bar plot and pie chart.

Task: Build `generate_company_data(n_company, n_employees)` generating employee records with performance and KPI scores. Summarise per company and identify top performers.

Task: Build `monte_carlo_pi(n_points)` to estimate π. Additionally compute the probability of random points falling inside a sub-square. Visualise points inside vs outside the circle.

Task: Build `normalize_columns(df)` and `z_score(df)` plus create new engineered features. Compare distributions before and after transformation.