FUNCTIONS & LOOPS

(Assignment Week-5)

🎵 🌷 🐱 🍰 🤣

Safina Zahra (52250033)

Student Majoring in Data Science

R Programming Data Science DS Programming


Introduction

Programming is a crucial instrument in the field of data science, enabling practitioners to perform complex data manipulation and achieve operational efficiency through automation. In developing program logic, a deep understanding of functions and loops serves as an essential foundation. Functions allow code to be written modularly and reused effectively, while loops provide the ability for a program to execute repetitive tasks quickly and accurately.

Mastering these two concepts not only improves code efficiency but also hones computational thinking skills in solving real-world data problems. I would like to express my sincere appreciation and gratitude to Mr. Bakti Siregar, M.Sc., CDS., as the Data Science Programming lecturer, for providing the guidance and valuable insights necessary to understand these programming logics comprehensively.

1. Dynamic Multi-Formula Function

This script defines the compute_formula function. It validates inputs to ensure only supported mathematical models are processed and uses nested loops to iterate through values of \(x\) and the requested formula types.

Implementation

# --- Load Necessary Libraries ---
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.5.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tidyr)

#' @title Dynamic Multi-Formula Function
#' @description Computes and validates linear, quadratic, cubic, and exponential models.
#' @param x_range A numeric vector (e.g., 1:20)
#' @param formulas A character vector of requested models
compute_formula <- function(x_range, formulas) {
  
  # --- 1. Formula Validation ---
  # Define supported models
  valid_list <- c("linear", "quadratic", "cubic", "exponential")
  
  # Check for unsupported inputs
  invalid_found <- formulas[!(formulas %in% valid_list)]
  
  if (length(invalid_found) > 0) {
    stop(paste("Validation Error: The following formulas are not supported:", 
               paste(invalid_found, collapse = ", ")))
  }
  
  # --- 2. Computation Logic (Nested Loops) ---
  final_data <- data.frame()
  
  # Outer Loop: Iterates through each specific formula type requested
  for (f in formulas) {
    
    # Initialize a vector to store results for the current formula
    y_results <- numeric(length(x_range))
    
    # Inner Loop: Iterates through each value in the domain (x)
    for (i in seq_along(x_range)) {
      current_x <- x_range[i]
      
      # Calculate result based on the formula logic
      y_results[i] <- switch(f,
        "linear"      = 3 * current_x + 15,         # y = mx + c
        "quadratic"   = (current_x^2) + 5,          # y = ax^2 + c
        "cubic"       = (0.05 * current_x^3) + 2,   # y = ax^3 + c
        "exponential" = 1.4^current_x               # y = a^x
      )
    }
    
    # Structure the results into a tidy data frame
    temp_df <- data.frame(
      x = x_range, 
      y = round(y_results, 2), 
      formula_type = f
    )
    
    # Append the results of the current formula to the main dataset
    final_data <- rbind(final_data, temp_df)
  }
  
  return(final_data)
}

# --- 3. Execution ---
# Define domain as per requirements (x = 1 to 20)
x_values <- 1:20
selected_models <- c("linear", "quadratic", "cubic", "exponential")

# Run the function
results_df <- compute_formula(x_values, selected_models)

# Print the first few rows to verify the calculation
print(head(results_df, 10))
##     x  y formula_type
## 1   1 18       linear
## 2   2 21       linear
## 3   3 24       linear
## 4   4 27       linear
## 5   5 30       linear
## 6   6 33       linear
## 7   7 36       linear
## 8   8 39       linear
## 9   9 42       linear
## 10 10 45       linear
Index X Value Y Result Formula Type
1 1 18.00 linear
2 2 21.00 linear
3 3 24.00 linear
4 4 27.00 linear
5 5 30.00 linear
6 6 33.00 linear
7 7 36.00 linear
8 8 39.00 linear
9 9 42.00 linear
10 10 45.00 linear

Visualization

# --- High-End Interactive Visualization ---
highchart() %>%
  hc_chart(type = "line", backgroundColor = "#FAFAFA", zoomType = "xy") %>%
  hc_title(text = "<b>Mathematical Growth Model Comparison</b>", 
           style = list(fontSize = "24px", color = "#2c3e50", fontFamily = "Helvetica")) %>%
  hc_subtitle(text = "Visualizing Linear, Polynomial, and Exponential Trends (x = 1:20)") %>%
  hc_xAxis(title = list(text = "Input Range (x)"), gridLineWidth = 1) %>%
  hc_yAxis(title = list(text = "Output Value (y)"), 
           gridLineDashStyle = "Dash",
           labels = list(format = "{value}")) %>%
  hc_colors(c("#1abc9c", "#3498db", "#9b59b6", "#e74c3c")) %>% 
  hc_add_series(results_df, "line", hcaes(x = x, y = y, group = formula_type)) %>%
  hc_plotOptions(series = list(
    marker = list(enabled = TRUE, symbol = "circle", radius = 4),
    lineWidth = 4,
    animation = list(duration = 2000)
  )) %>%
  hc_tooltip(shared = TRUE, crosshairs = TRUE, pointFormat = "<b>{series.name}:</b> {point.y}<br/>") %>%
  hc_legend(align = "center", verticalAlign = "bottom", layout = "horizontal") %>%
  hc_exporting(enabled = TRUE)

Interpretation

The mathematical growth patterns analyzed in this task can be categorized into several distinct trajectories based on their algebraic properties. The Linear Model exhibits a constant rate of change, which appears as a straight diagonal line on a graph, representing predictable and steady growth over time. In contrast, Quadratic and Cubic Models represent polynomial growth; the cubic model specifically shows a significantly sharper upward curve compared to the quadratic model, as the power of 3 increases output values more aggressively for larger values of \(x\). Meanwhile, the Exponential Model typically starts with a slow progression but eventually surpasses all other models, reflecting a “runaway” growth trajectory where the rate of change is directly proportional to the current value. From a programming perspective, the implementation of Nested Loop Logic ensures high efficiency and scalability; by utilizing an outer loop to iterate through formula types and an inner loop for individual data points, the code remains clean, organized, and easily adaptable for the inclusion of additional mathematical models in the future.

2. Nested Simulation: Multi-Sales & Discounts

This function, simulate_sales, utilizes a nested loop structure: the outer loop iterates through individual salespersons, while the inner loop simulates daily transactions and applies conditional discounting logic.

Implementation

# --- Load Necessary Libraries ---
library(dplyr)
library(highcharter)
library(tidyr)

#' @title Nested Sales Simulation
#' @description Simulates daily sales for multiple agents and applies conditional discounts.
simulate_sales <- function(n_salesperson, days) {
  
  # --- Inner Nested Function: Cumulative Logic ---
  calculate_cumulative <- function(sales_vector) {
    return(cumsum(sales_vector))
  }
  
  full_dataset <- data.frame()
  
  # --- Outer Loop: Per Salesperson ---
  for (s_id in 1:n_salesperson) {
    
    # Generate random daily sales amounts
    daily_amounts <- round(runif(days, min = 100, max = 1000), 2)
    
    # --- Conditional Discount Logic ---
    discounts <- ifelse(daily_amounts > 800, 0.15,
                  ifelse(daily_amounts > 500, 0.10, 0))
    
    # Create temporary dataframe
    temp_df <- data.frame(
      sales_id = paste("Agent", s_id),
      day = 1:days,
      sales_amount = daily_amounts,
      discount_rate = discounts,
      net_sales = round(daily_amounts * (1 - discounts), 2)
    )
    
    # Calculate Cumulative Sales using the nested function
    temp_df$cumulative_sales <- calculate_cumulative(temp_df$net_sales)
    full_dataset <- rbind(full_dataset, temp_df)
  }
  
  return(full_dataset)
}

# --- Execution ---
set.seed(123) 
simulation_data <- simulate_sales(n_salesperson = 5, days = 14)

# --- Summary Statistics ---
summary_stats <- simulation_data %>%
  group_by(sales_id) %>%
  summarise(
    Total_Revenue = sum(net_sales),
    Avg_Daily_Sale = round(mean(net_sales), 2),
    Max_Single_Day = max(net_sales)
  )

# Output summary to console
print(summary_stats)
## # A tibble: 5 × 4
##   sales_id Total_Revenue Avg_Daily_Sale Max_Single_Day
##   <chr>            <dbl>          <dbl>          <dbl>
## 1 Agent 1          7970.           569.           817.
## 2 Agent 2          7894.           564.           846.
## 3 Agent 3          6527.           466.           822.
## 4 Agent 4          5600.           400.           741.
## 5 Agent 5          7490.           535.           770.
Sales Agent Total Revenue (\() | Avg Daily Sale (\)) Max Single Day ($)
Agent 1 7,970 569 817
Agent 2 7,894 564 846
Agent 3 6,527 466 822
Agent 4 5,600 400 741
Agent 5 7,490 535 770

Visualization

# --- Cumulative Sales Growth Plot ---
highchart() %>%
  hc_chart(type = "line", backgroundColor = "#F9F9F9") %>%
  hc_title(text = "<b>Salesperson Performance: Cumulative Growth</b>",
           style = list(fontFamily = "Inter", fontSize = "22px")) %>%
  hc_subtitle(text = "Tracking Net Sales over 14 Days (After Discounts)") %>%
  hc_xAxis(title = list(text = "Day Number")) %>%
  hc_yAxis(title = list(text = "Total Cumulative Sales ($)"), gridLineDashStyle = "Dot") %>%
  hc_colors(c("#16a085", "#2980b9", "#8e44ad", "#f39c12", "#c0392b")) %>%
  hc_add_series(simulation_data, "line", hcaes(x = day, y = cumulative_sales, group = sales_id)) %>%
  hc_plotOptions(series = list(marker = list(enabled = TRUE, radius = 4), lineWidth = 3)) %>%
  hc_tooltip(shared = TRUE, crosshairs = TRUE) %>%
  hc_legend(align = "center", verticalAlign = "bottom", layout = "horizontal")

Interpretation

The simulation successfully demonstrates how a nested function architecture allows for localized data processing within a larger loop, making the code both modular and readable. By analyzing the Cumulative Trajectory, we can identify which agents maintain consistent performance versus those who rely on volatile high-value spikes. The Conditional Discounting logic effectively simulates a tiered business incentive, where higher sales volume is rewarded with a lower net price for the customer while tracking the final revenue generated for the organization.

3. Multi-Level Performance Categorization

The function categorize_performance uses a vectorized approach (looping through a numeric vector) to assign labels based on predefined financial thresholds.

Implementation & Visualization

#' @title Performance Categorization Function
#' @description Categorizes sales and calculates percentage distribution.
categorize_performance <- function(sales_amounts) {
  
  categories <- character(length(sales_amounts))
  
  # 1. Loop through vector to categorize
  for (i in seq_along(sales_amounts)) {
    val <- sales_amounts[i]
    categories[i] <- ifelse(val >= 800, "Excellent",
                      ifelse(val >= 600, "Very Good",
                        ifelse(val >= 400, "Good",
                          ifelse(val >= 200, "Average", "Poor"))))
  }
  
  # 2. Calculate distribution
  summary_df <- data.frame(Category = categories) %>%
    group_by(Category) %>%
    summarise(Count = n()) %>%
    mutate(Percentage = round((Count / sum(Count)) * 100, 2))
  
  return(summary_df)
}

# Execution with Sample Data
set.seed(456)
sample_sales <- runif(50, 50, 1000)
perf_summary <- categorize_performance(sample_sales)

# Visual 1: Bar Chart (Frequency)
hchart(perf_summary, "column", hcaes(x = Category, y = Count), name = "Staff Count") %>%
  hc_title(text = "<b>Performance Distribution (Bar)</b>") %>%
  hc_colors("#2ecc71")
# Visual 2: Pie Chart (Proportion)
hchart(perf_summary, "pie", hcaes(x = Category, y = Percentage), name = "Percentage") %>%
  hc_title(text = "<b>Staff Contribution (%)</b>") %>%
  hc_plotOptions(pie = list(dataLabels = list(enabled = TRUE, format = "{point.name}: {point.y}%")))

Interpretation

The distribution analysis provides an immediate health check of sales operations. The Bar Plot identifies the most common performance tier (the mode), while the Pie Chart highlights whether the organization is “top-heavy” (mostly Excellent) or struggling (mostly Poor). This logic is essential for HR management and setting future sales targets.

4. Multi-Company Dataset Simulation

The function generate_company_data is designed to simulate large-scale organizational data. It utilizes nested loops: the outer loop handles the creation of companies, while the inner loop generates randomized metrics for each individual employee.

Implementation

# --- Libraries ---
library(dplyr)
library(highcharter)
library(purrr)
library(tidyr)

# --- Simulation Function ---
# Generates a synthetic dataset for multiple companies and their employees
generate_company_data <- function(n_company, n_employees) {
  
  full_dataset <- data.frame()
  departments <- c("Data Science", "Engineering", "Marketing", "Finance", "Operations")
  
  # Outer Loop: Iterate per Company
  for (c_id in 1:n_company) {
    company_name <- paste("Company", LETTERS[c_id])
    
    # Inner Loop: Iterate per Employee
    for (e_id in 1:n_employees) {
      
      # Metrics Generation
      salary <- round(rnorm(1, mean = 5000, sd = 1200), 2)
      perf_score <- round(runif(1, min = 50, max = 100), 1)
      
      # Conditional Logic: High performers (Score > 85) get higher KPI distributions
      if (perf_score > 85) {
        kpi_base <- rnorm(1, mean = 92, sd = 4)
      } else {
        kpi_base <- rnorm(1, mean = 75, sd = 10)
      }
      
      kpi_score <- round(min(kpi_base, 100), 1) # Cap at 100
      
      # KPI Threshold Labeling
      status <- ifelse(kpi_score > 90, "Top Performer", "Standard")
      
      # Row Construction
      row <- data.frame(
        company_id = company_name,
        employee_id = paste0("EMP-", c_id, "-", sprintf("%03d", e_id)),
        department = sample(departments, 1),
        salary = salary,
        performance_score = perf_score,
        kpi_score = kpi_score,
        performance_tier = status
      )
      
      full_dataset <- rbind(full_dataset, row)
    }
  }
  return(full_dataset)
}

# --- Execute Simulation ---
set.seed(42) 
raw_data <- generate_company_data(n_company = 5, n_employees = 50)

# --- Create Summary Table ---
# Aggregating metrics to interpret company-level trends
company_summary <- raw_data %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = round(mean(salary), 2),
    avg_performance = round(mean(performance_score), 2),
    max_kpi = max(kpi_score),
    top_performers = sum(performance_tier == "Top Performer")
  )

# Display the summary table
print(company_summary)
## # A tibble: 5 × 5
##   company_id avg_salary avg_performance max_kpi top_performers
##   <chr>           <dbl>           <dbl>   <dbl>          <int>
## 1 Company A       4934.            73.0    98.5             10
## 2 Company B       4830.            75.6    95.9             13
## 3 Company C       4689.            73.4   100               12
## 4 Company D       4945.            77.2    97.2             16
## 5 Company E       5027.            73.1   100               16
Company ID Avg Salary ($) Avg Performance Max KPI Top Performers
Company A 4,934 73.0 98.5 10
Company B 4,830 75.6 95.9 13
Company C 4,689 73.4 100.0 12
Company D 4,945 77.2 97.2 16
Company E 5,027 73.1 100.0 16

Visualization

highchart() %>%
  hc_chart(type = "column", backgroundColor = "#FCFCFC") %>%
  hc_title(text = "<b>Company Performance & Salary Insights</b>", 
           style = list(color = "#2c3e50", useHTML = TRUE)) %>%
  hc_subtitle(text = "Aggregated simulation of salary vs performance metrics") %>%
  hc_xAxis(categories = company_summary$company_id) %>%
  hc_yAxis_multiples(
    list(title = list(text = "Average Salary ($)"), opposite = FALSE),
    list(title = list(text = "Avg Performance Score"), opposite = TRUE)
  ) %>%
  hc_add_series(name = "Avg Salary", data = company_summary$avg_salary, yAxis = 0, color = "#1abc9c") %>%
  hc_add_series(name = "Avg Performance", data = company_summary$avg_performance, yAxis = 1, color = "#3498db") %>%
  hc_tooltip(shared = TRUE, crosshairs = TRUE) %>%
  hc_add_theme(hc_theme_smpl())
raw_data %>%
  hchart("scatter", hcaes(x = performance_score, y = kpi_score, group = company_id)) %>%
  hc_title(text = "<b>Performance vs. KPI Correlation</b>") %>%
  hc_subtitle(text = "Distribution of employees across all simulated companies") %>%
  hc_colors(c("#1abc9c", "#3498db", "#9b59b6", "#e67e22", "#e74c3c")) %>%
  hc_xAxis(title = list(text = "General Performance Score")) %>%
  hc_yAxis(title = list(text = "Specific KPI Score")) %>%
  hc_plotOptions(scatter = list(marker = list(radius = 4, symbol = "circle")))

Interpretation

The simulation successfully generated a dataset for 5 companies and 250 employees using a nested looping approach to maintain a clear organizational hierarchy. A conditional logic was implemented to create a positive correlation between performance and results; specifically, employees with a performance score above 85 have a significantly higher probability of achieving a KPI score greater than 90. Furthermore, the financial data follows a normal distribution (\(\mu=5000, \sigma=1200\)), providing a realistic variance in compensation levels across the simulated workforce.

5. Monte Carlo Simulation: Pi & Probability

This script defines the monte_carlo_pi function to estimate the value of \(\pi\) and analyze the spatial probability of random points. It utilizes a loop to generate coordinate points and applies conditional logic to process the simulation results.

Implementation

# --- Libraries ---
library(dplyr)
library(highcharter)

# --- Monte Carlo Pi Function ---
monte_carlo_pi <- function(n_points) {
  
  # Generate random coordinates in a 1x1 square [0,1]
  set.seed(123)
  x <- runif(n_points, min = 0, max = 1)
  y <- runif(n_points, min = 0, max = 1)
  
  # Calculate distance from origin (0,0) to check if inside the quarter circle
  distance_squared <- x^2 + y^2
  is_inside <- distance_squared <= 1
  
  # Probability analysis: Points falling in a specific sub-square (e.g., 0 < x,y < 0.5)
  is_in_sub_square <- (x <= 0.5 & y <= 0.5)
  prob_sub_square <- sum(is_in_sub_square) / n_points
  
  # Estimate Pi: (Points in Circle / Total Points) * 4
  pi_estimate <- (sum(is_inside) / n_points) * 4
  
  # Store results in a dataframe for visualization
  sim_data <- data.frame(
    x = x,
    y = y,
    status = ifelse(is_inside, "Inside Circle", "Outside Circle"),
    in_sub_square = is_in_sub_square
  )
  
  return(list(data = sim_data, pi = pi_estimate, prob_sub = prob_sub_square))
}
# --- Execute Simulation ---
# Using 5,000 points for a balance of precision and visualization performance
results <- monte_carlo_pi(n_points = 5000)

# --- Summary Statistics ---
cat("Estimated Pi Value:", results$pi, "\n")
## Estimated Pi Value: 3.1816
cat("Actual Pi Value:   ", pi, "\n")
## Actual Pi Value:    3.141593
cat("Error Margin:      ", abs(pi - results$pi), "\n")
## Error Margin:       0.04000735
cat("Prob. in Sub-Square (0.5x0.5):", results$prob_sub * 100, "%\n")
## Prob. in Sub-Square (0.5x0.5): 25.06 %

Visualization

# --- Visualization (Highcharter) ---
highchart() %>%
  hc_chart(type = "scatter", zoomType = "xy", backgroundColor = "#F9F9F9") %>%
  hc_title(text = "<b>Monte Carlo Simulation: Estimating Pi</b>", 
           style = list(color = "#2c3e50", useHTML = TRUE)) %>%
  hc_subtitle(text = paste("Total Points:", nrow(results$data), "| Estimated Pi:", results$pi)) %>%
  hc_xAxis(title = list(text = "X Coordinate"), min = 0, max = 1) %>%
  hc_yAxis(title = list(text = "Y Coordinate"), min = 0, max = 1, height = "100%", width = "100%") %>%
  hc_add_series(results$data, "scatter", hcaes(x = x, y = y, group = status),
                marker = list(radius = 2)) %>%
  hc_colors(c("#3498db", "#e74c3c")) %>% # Blue for Inside, Red for Outside
  hc_tooltip(pointFormat = "X: {point.x:.3f}<br>Y: {point.y:.3f}<br>{point.status}") %>%
  hc_plotOptions(scatter = list(states = list(hover = list(enabled = TRUE))))

Interpretation

The simulation utilizes stochastic logic by evaluating the ratio between the area of a circle (\(A = \pi r^2\)) and its circumscribed square (\(A = (2r)^2\)); by generating random points, the proportion of those falling inside the circle relative to the total approximates \(\frac{\pi}{4}\). This is supported by a probability analysis of the sub-square, which confirms a geometric probability of approximately 25% (\(0.5 \times 0.5\)), validating the uniformity of the random number generator used in the script. Ultimately, as the number of iterations (\(n\)) increases, the convergence of the estimated value toward the true mathematical constant demonstrates the Law of Large Numbers in action, ensuring higher precision with larger datasets.

6. Advanced Data Transformation & Feature Engineering

This script defines the normalize_columns and z_score functions to standardize datasets and generate new analytical features. It utilizes loop-based transformations to process multiple data columns efficiently and applies conditional logic to categorize data.

Implementation

# Load necessary library for interactive charts
# install.packages("highcharter")
library(highcharter)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
# Seed for reproducibility
set.seed(2026)

# Generate synthetic dataset
df <- data.frame(
  employee_id = 1:100,
  salary = rnorm(100, mean = 5500, sd = 1200),
  performance_score = runif(100, min = 10, max = 95)
)

normalize_columns <- function(df) {
  # Identify numeric columns (excluding ID)
  cols_to_fix <- names(df)[sapply(df, is.numeric) & names(df) != "employee_id"]
  
  for (col in cols_to_fix) {
    min_val <- min(df[[col]], na.rm = TRUE)
    max_val <- max(df[[col]], na.rm = TRUE)
    df[[paste0(col, "_normalized")]] <- (df[[col]] - min_val) / (max_val - min_val)
  }
  return(df)
}

z_score <- function(df) {
  cols_to_fix <- names(df)[sapply(df, is.numeric) & !grepl("_normalized|employee_id", names(df))]
  
  for (col in cols_to_fix) {
    mu <- mean(df[[col]], na.rm = TRUE)
    sigma <- sd(df[[col]], na.rm = TRUE)
    df[[paste0(col, "_zscore")]] <- (df[[col]] - mu) / sigma
  }
  return(df)
}

# Apply transformations
df <- normalize_columns(df)
df <- z_score(df)

# Feature Creation: Performance Categories and Salary Brackets
df$performance_category <- ifelse(df$performance_score >= 80, "High Performer",
                            ifelse(df$performance_score >= 50, "Average", "Below Average"))

df$salary_bracket <- cut(df$salary, 
                         breaks = quantile(df$salary, probs = c(0, 0.33, 0.66, 1)),
                         labels = c("Low", "Medium", "High"), 
                         include.lowest = TRUE)

Visualization

# 1. Prepare Histogram Data for Highcharts
# We calculate the bins for both Raw and Z-Score data
h_raw <- hist(df$salary, plot = FALSE, breaks = 15)
h_z   <- hist(df$salary_zscore, plot = FALSE, breaks = 15)

# 2. Generate the High-End Interactive Histogram
highchart() %>%
  hc_chart(backgroundColor = "#FAFAFA", zoomType = "xy") %>%
  hc_title(text = "<b>Distribution Analysis: Raw vs. Standardized</b>", 
           style = list(fontSize = "22px", color = "#2c3e50", fontFamily = "Helvetica")) %>%
  hc_subtitle(text = "Interactive Histogram showing Frequency Bins") %>%
  
  # Dual X-Axes for comparison
  hc_xAxis_multiples(
    list(title = list(text = "Raw Salary Values"), col = "#3498db", opposite = FALSE),
    list(title = list(text = "Z-Score Values"), col = "#e74c3c", opposite = TRUE)
  ) %>%
  
  hc_yAxis(title = list(text = "Frequency (Count)")) %>%
  
  # Adding Raw Salary Series (using the first X-axis)
  hc_add_series(name = "Raw Salary", 
                data = h_raw$counts, 
                type = "column", 
                xAxis = 0, 
                color = "#3498db",
                pointPadding = 0, 
                groupPadding = 0.1) %>%
  
  # Adding Z-Score Series (using the second X-axis)
  hc_add_series(name = "Z-Score Salary", 
                data = h_z$counts, 
                type = "column", 
                xAxis = 1, 
                color = "#e74c3c",
                pointPadding = 0, 
                groupPadding = 0.1) %>%
  
  hc_plotOptions(column = list(
    borderRadius = 2,
    borderWidth = 0.5,
    borderColor = "#FFFFFF",
    tooltip = list(pointFormat = "<b>Frequency:</b> {point.y}")
  )) %>%
  hc_tooltip(shared = FALSE, crosshairs = TRUE) %>%
  hc_legend(enabled = TRUE) %>%
  hc_exporting(enabled = TRUE)
hcboxplot(x = df$performance_score, var = df$salary_bracket, outliers = TRUE, color = "#1abc9c") %>%
  hc_chart(type = "column", backgroundColor = "#FFFFFF") %>%
  hc_title(text = "<b>Performance Distribution by Salary Bracket</b>") %>%
  hc_subtitle(text = "Engineered categorical feature comparison") %>%
  hc_xAxis(title = list(text = "Salary Bracket (Engineered)")) %>%
  hc_yAxis(title = list(text = "Performance Score")) %>%
  hc_plotOptions(series = list(animation = list(duration = 1500)))
## Warning in hcboxplot(x = df$performance_score, var = df$salary_bracket, : 'hcboxplot' is deprecated.
## Use 'data_to_boxplot' instead.
## See help("Deprecated")

Interpretation

The data transformation process utilized Min-Max Normalization to squash features into a fixed \([0, 1]\) range, which is essential for algorithms sensitive to variable magnitude, while Z-score Standardization was applied to center the data around a mean (\(\mu = 0\)) to facilitate the identification of statistical outliers. Through feature engineering logic, continuous numerical noise was transformed into meaningful business segments specifically performance category and salary bracket enabling categorical comparisons such as whether “High” performers consistently align with the “Executive” salary bracket. Ultimately, visual insights from the histogram comparison demonstrate that while the underlying shape of the distribution remains intact, the scale is shifted to a standard format, allowing for a direct and balanced comparison between disparate units, such as salary in dollars versus performance in percentages.

7. Mini Project: Company KPI Dashboard & Simulation

This script generates a comprehensive dataset for multiple companies (5-10) and their employees (50-200 each) to build an automated performance dashboard. It utilizes nested loops and conditional logic to simulate complex organizational structures and performance metrics.

Implementation

# --- Libraries ---
library(dplyr)
library(highcharter)
library(tidyr)
library(reactable) 
## Warning: package 'reactable' was built under R version 4.5.2
# --- Data Generation Function ---
generate_mini_project_data <- function(n_companies = 7, min_emp = 50, max_emp = 200) {
  full_data <- data.frame()
  depts <- c("Engineering", "Data Science", "Sales", "HR", "Product")
  
  for (i in 1:n_companies) {
    comp_name <- paste("Company", LETTERS[i])
    n_emp <- sample(min_emp:max_emp, 1)
    
    for (j in 1:n_emp) {
      # Random generation with normal and uniform distributions
      salary <- round(rnorm(1, mean = 6000, sd = 1500), 2)
      perf_score <- round(runif(1, 40, 100), 1)
      kpi_score <- round(pmin(100, perf_score * rnorm(1, 1.02, 0.08)), 1)
      
      row <- data.frame(
        employee_id = paste0("EMP-", i, "-", j),
        company_id = comp_name,
        department = sample(depts, 1),
        salary = salary,
        performance_score = perf_score,
        KPI_score = kpi_score
      )
      full_data <- rbind(full_data, row)
    }
  }
  return(full_data)
}

set.seed(2026)
master_df <- generate_mini_project_data()
# --- Categorization Loop ---
master_df$KPI_tier <- NA

for (i in 1:nrow(master_df)) {
  score <- master_df$KPI_score[i]
  
  if (score >= 90) {
    master_df$KPI_tier[i] <- "Elite"
  } else if (score >= 75) {
    master_df$KPI_tier[i] <- "High Achiever"
  } else if (score >= 55) {
    master_df$KPI_tier[i] <- "Core"
  } else {
    master_df$KPI_tier[i] <- "Underperforming"
  }
}
# --- Summary Table Calculation ---
company_summary <- master_df %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = round(mean(salary), 2),
    avg_KPI = round(mean(KPI_score), 2),
    top_performers_count = sum(KPI_tier == "Elite"),
    total_staff = n()
  )

# Outputting the Table
reactable(
  company_summary,
  bordered = TRUE, striped = TRUE, highlight = TRUE,
  columns = list(
    avg_salary = colDef(name = "Avg Salary ($)", format = colFormat(separators = TRUE)),
    avg_KPI = colDef(name = "Avg KPI Score"),
    top_performers_count = colDef(name = "Elite Count", style = list(color = "#2ecc71", fontWeight = "bold")),
    total_staff = colDef(name = "Total Employees")
  )
)

Visualization

# Summarizing KPI by Department
dept_data <- master_df %>%
  group_by(department) %>%
  summarise(mean_kpi = mean(KPI_score))

hchart(dept_data, "column", hcaes(x = department, y = mean_kpi), name = "Average KPI") %>%
  hc_title(text = "<b>KPI Performance by Department</b>") %>%
  hc_colors("#3498db") %>%
  hc_add_theme(hc_theme_smpl())
# Creating Scatter Plot with Linear Regression Trend
hchart(master_df, "scatter", hcaes(x = performance_score, y = salary, group = company_id)) %>%
  hc_title(text = "<b>Salary Distribution vs. Performance Score</b>") %>%
  hc_subtitle(text = "Analysis of compensation fairness across corporate entities") %>%
  hc_xAxis(title = list(text = "Performance Rating")) %>%
  hc_yAxis(title = list(text = "Salary ($)")) %>%
  hc_plotOptions(scatter = list(marker = list(radius = 3))) %>%
  hc_add_series(master_df, "line", hcaes(x = performance_score, y = predict(lm(salary ~ performance_score, data = master_df))), 
                name = "Market Trend", color = "black", dashStyle = "Dash")

Interpretation

The automated dataset generation effectively simulated an organizational structure of 7 companies with varying workforce sizes, ensuring the data reflects real world variability. Through advanced loop based categorization, the workforce was successfully segmented into four distinct KPI tiers, revealing that “Elite” performers are the primary drivers of company-wide averages. The interactive project dashboards highlight a clear departmental variance; specifically, the scatter analysis with the regression trendline suggests a positive correlation between performance and compensation, demonstrating the power of using stochastic modeling to inform corporate human capital strategies.