FUNCTIONS & LOOPS
(Assignment Week-5)
🎵 🌷 🐱 🍰 🤣
Safina Zahra (52250033)
Student Majoring in Data Science
R Programming Data Science DS Programming
Introduction
Programming is a crucial instrument in the field of data science, enabling practitioners to perform complex data manipulation and achieve operational efficiency through automation. In developing program logic, a deep understanding of functions and loops serves as an essential foundation. Functions allow code to be written modularly and reused effectively, while loops provide the ability for a program to execute repetitive tasks quickly and accurately.
Mastering these two concepts not only improves code efficiency but also hones computational thinking skills in solving real-world data problems. I would like to express my sincere appreciation and gratitude to Mr. Bakti Siregar, M.Sc., CDS., as the Data Science Programming lecturer, for providing the guidance and valuable insights necessary to understand these programming logics comprehensively.
1. Dynamic Multi-Formula Function
This script defines the compute_formula function. It validates inputs to ensure only supported mathematical models are processed and uses nested loops to iterate through values of \(x\) and the requested formula types.
Implementation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'highcharter' was built under R version 4.5.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tidyr)
#' @title Dynamic Multi-Formula Function
#' @description Computes and validates linear, quadratic, cubic, and exponential models.
#' @param x_range A numeric vector (e.g., 1:20)
#' @param formulas A character vector of requested models
compute_formula <- function(x_range, formulas) {
# --- 1. Formula Validation ---
# Define supported models
valid_list <- c("linear", "quadratic", "cubic", "exponential")
# Check for unsupported inputs
invalid_found <- formulas[!(formulas %in% valid_list)]
if (length(invalid_found) > 0) {
stop(paste("Validation Error: The following formulas are not supported:",
paste(invalid_found, collapse = ", ")))
}
# --- 2. Computation Logic (Nested Loops) ---
final_data <- data.frame()
# Outer Loop: Iterates through each specific formula type requested
for (f in formulas) {
# Initialize a vector to store results for the current formula
y_results <- numeric(length(x_range))
# Inner Loop: Iterates through each value in the domain (x)
for (i in seq_along(x_range)) {
current_x <- x_range[i]
# Calculate result based on the formula logic
y_results[i] <- switch(f,
"linear" = 3 * current_x + 15, # y = mx + c
"quadratic" = (current_x^2) + 5, # y = ax^2 + c
"cubic" = (0.05 * current_x^3) + 2, # y = ax^3 + c
"exponential" = 1.4^current_x # y = a^x
)
}
# Structure the results into a tidy data frame
temp_df <- data.frame(
x = x_range,
y = round(y_results, 2),
formula_type = f
)
# Append the results of the current formula to the main dataset
final_data <- rbind(final_data, temp_df)
}
return(final_data)
}
# --- 3. Execution ---
# Define domain as per requirements (x = 1 to 20)
x_values <- 1:20
selected_models <- c("linear", "quadratic", "cubic", "exponential")
# Run the function
results_df <- compute_formula(x_values, selected_models)
# Print the first few rows to verify the calculation
print(head(results_df, 10))## x y formula_type
## 1 1 18 linear
## 2 2 21 linear
## 3 3 24 linear
## 4 4 27 linear
## 5 5 30 linear
## 6 6 33 linear
## 7 7 36 linear
## 8 8 39 linear
## 9 9 42 linear
## 10 10 45 linear
| Index | X Value | Y Result | Formula Type |
|---|---|---|---|
| 1 | 1 | 18.00 | linear |
| 2 | 2 | 21.00 | linear |
| 3 | 3 | 24.00 | linear |
| 4 | 4 | 27.00 | linear |
| 5 | 5 | 30.00 | linear |
| 6 | 6 | 33.00 | linear |
| 7 | 7 | 36.00 | linear |
| 8 | 8 | 39.00 | linear |
| 9 | 9 | 42.00 | linear |
| 10 | 10 | 45.00 | linear |
Visualization
# --- High-End Interactive Visualization ---
highchart() %>%
hc_chart(type = "line", backgroundColor = "#FAFAFA", zoomType = "xy") %>%
hc_title(text = "<b>Mathematical Growth Model Comparison</b>",
style = list(fontSize = "24px", color = "#2c3e50", fontFamily = "Helvetica")) %>%
hc_subtitle(text = "Visualizing Linear, Polynomial, and Exponential Trends (x = 1:20)") %>%
hc_xAxis(title = list(text = "Input Range (x)"), gridLineWidth = 1) %>%
hc_yAxis(title = list(text = "Output Value (y)"),
gridLineDashStyle = "Dash",
labels = list(format = "{value}")) %>%
hc_colors(c("#1abc9c", "#3498db", "#9b59b6", "#e74c3c")) %>%
hc_add_series(results_df, "line", hcaes(x = x, y = y, group = formula_type)) %>%
hc_plotOptions(series = list(
marker = list(enabled = TRUE, symbol = "circle", radius = 4),
lineWidth = 4,
animation = list(duration = 2000)
)) %>%
hc_tooltip(shared = TRUE, crosshairs = TRUE, pointFormat = "<b>{series.name}:</b> {point.y}<br/>") %>%
hc_legend(align = "center", verticalAlign = "bottom", layout = "horizontal") %>%
hc_exporting(enabled = TRUE)Interpretation
The mathematical growth patterns analyzed in this task can be categorized into several distinct trajectories based on their algebraic properties. The Linear Model exhibits a constant rate of change, which appears as a straight diagonal line on a graph, representing predictable and steady growth over time. In contrast, Quadratic and Cubic Models represent polynomial growth; the cubic model specifically shows a significantly sharper upward curve compared to the quadratic model, as the power of 3 increases output values more aggressively for larger values of \(x\). Meanwhile, the Exponential Model typically starts with a slow progression but eventually surpasses all other models, reflecting a “runaway” growth trajectory where the rate of change is directly proportional to the current value. From a programming perspective, the implementation of Nested Loop Logic ensures high efficiency and scalability; by utilizing an outer loop to iterate through formula types and an inner loop for individual data points, the code remains clean, organized, and easily adaptable for the inclusion of additional mathematical models in the future.
2. Nested Simulation: Multi-Sales & Discounts
This function, simulate_sales, utilizes a nested loop structure: the outer loop iterates through individual salespersons, while the inner loop simulates daily transactions and applies conditional discounting logic.
Implementation
# --- Load Necessary Libraries ---
library(dplyr)
library(highcharter)
library(tidyr)
#' @title Nested Sales Simulation
#' @description Simulates daily sales for multiple agents and applies conditional discounts.
simulate_sales <- function(n_salesperson, days) {
# --- Inner Nested Function: Cumulative Logic ---
calculate_cumulative <- function(sales_vector) {
return(cumsum(sales_vector))
}
full_dataset <- data.frame()
# --- Outer Loop: Per Salesperson ---
for (s_id in 1:n_salesperson) {
# Generate random daily sales amounts
daily_amounts <- round(runif(days, min = 100, max = 1000), 2)
# --- Conditional Discount Logic ---
discounts <- ifelse(daily_amounts > 800, 0.15,
ifelse(daily_amounts > 500, 0.10, 0))
# Create temporary dataframe
temp_df <- data.frame(
sales_id = paste("Agent", s_id),
day = 1:days,
sales_amount = daily_amounts,
discount_rate = discounts,
net_sales = round(daily_amounts * (1 - discounts), 2)
)
# Calculate Cumulative Sales using the nested function
temp_df$cumulative_sales <- calculate_cumulative(temp_df$net_sales)
full_dataset <- rbind(full_dataset, temp_df)
}
return(full_dataset)
}
# --- Execution ---
set.seed(123)
simulation_data <- simulate_sales(n_salesperson = 5, days = 14)
# --- Summary Statistics ---
summary_stats <- simulation_data %>%
group_by(sales_id) %>%
summarise(
Total_Revenue = sum(net_sales),
Avg_Daily_Sale = round(mean(net_sales), 2),
Max_Single_Day = max(net_sales)
)
# Output summary to console
print(summary_stats)## # A tibble: 5 × 4
## sales_id Total_Revenue Avg_Daily_Sale Max_Single_Day
## <chr> <dbl> <dbl> <dbl>
## 1 Agent 1 7970. 569. 817.
## 2 Agent 2 7894. 564. 846.
## 3 Agent 3 6527. 466. 822.
## 4 Agent 4 5600. 400. 741.
## 5 Agent 5 7490. 535. 770.
| Sales Agent | Total Revenue (\() | Avg Daily Sale (\)) | Max Single Day ($) | |
|---|---|---|---|
| Agent 1 | 7,970 | 569 | 817 |
| Agent 2 | 7,894 | 564 | 846 |
| Agent 3 | 6,527 | 466 | 822 |
| Agent 4 | 5,600 | 400 | 741 |
| Agent 5 | 7,490 | 535 | 770 |
Visualization
# --- Cumulative Sales Growth Plot ---
highchart() %>%
hc_chart(type = "line", backgroundColor = "#F9F9F9") %>%
hc_title(text = "<b>Salesperson Performance: Cumulative Growth</b>",
style = list(fontFamily = "Inter", fontSize = "22px")) %>%
hc_subtitle(text = "Tracking Net Sales over 14 Days (After Discounts)") %>%
hc_xAxis(title = list(text = "Day Number")) %>%
hc_yAxis(title = list(text = "Total Cumulative Sales ($)"), gridLineDashStyle = "Dot") %>%
hc_colors(c("#16a085", "#2980b9", "#8e44ad", "#f39c12", "#c0392b")) %>%
hc_add_series(simulation_data, "line", hcaes(x = day, y = cumulative_sales, group = sales_id)) %>%
hc_plotOptions(series = list(marker = list(enabled = TRUE, radius = 4), lineWidth = 3)) %>%
hc_tooltip(shared = TRUE, crosshairs = TRUE) %>%
hc_legend(align = "center", verticalAlign = "bottom", layout = "horizontal")Interpretation
The simulation successfully demonstrates how a nested function architecture allows for localized data processing within a larger loop, making the code both modular and readable. By analyzing the Cumulative Trajectory, we can identify which agents maintain consistent performance versus those who rely on volatile high-value spikes. The Conditional Discounting logic effectively simulates a tiered business incentive, where higher sales volume is rewarded with a lower net price for the customer while tracking the final revenue generated for the organization.
3. Multi-Level Performance Categorization
The function categorize_performance uses a vectorized approach (looping through a numeric vector) to assign labels based on predefined financial thresholds.
Implementation & Visualization
#' @title Performance Categorization Function
#' @description Categorizes sales and calculates percentage distribution.
categorize_performance <- function(sales_amounts) {
categories <- character(length(sales_amounts))
# 1. Loop through vector to categorize
for (i in seq_along(sales_amounts)) {
val <- sales_amounts[i]
categories[i] <- ifelse(val >= 800, "Excellent",
ifelse(val >= 600, "Very Good",
ifelse(val >= 400, "Good",
ifelse(val >= 200, "Average", "Poor"))))
}
# 2. Calculate distribution
summary_df <- data.frame(Category = categories) %>%
group_by(Category) %>%
summarise(Count = n()) %>%
mutate(Percentage = round((Count / sum(Count)) * 100, 2))
return(summary_df)
}
# Execution with Sample Data
set.seed(456)
sample_sales <- runif(50, 50, 1000)
perf_summary <- categorize_performance(sample_sales)
# Visual 1: Bar Chart (Frequency)
hchart(perf_summary, "column", hcaes(x = Category, y = Count), name = "Staff Count") %>%
hc_title(text = "<b>Performance Distribution (Bar)</b>") %>%
hc_colors("#2ecc71")Interpretation
The distribution analysis provides an immediate health check of sales operations. The Bar Plot identifies the most common performance tier (the mode), while the Pie Chart highlights whether the organization is “top-heavy” (mostly Excellent) or struggling (mostly Poor). This logic is essential for HR management and setting future sales targets.
4. Multi-Company Dataset Simulation
The function generate_company_data is designed to simulate large-scale organizational data. It utilizes nested loops: the outer loop handles the creation of companies, while the inner loop generates randomized metrics for each individual employee.
Implementation
# --- Libraries ---
library(dplyr)
library(highcharter)
library(purrr)
library(tidyr)
# --- Simulation Function ---
# Generates a synthetic dataset for multiple companies and their employees
generate_company_data <- function(n_company, n_employees) {
full_dataset <- data.frame()
departments <- c("Data Science", "Engineering", "Marketing", "Finance", "Operations")
# Outer Loop: Iterate per Company
for (c_id in 1:n_company) {
company_name <- paste("Company", LETTERS[c_id])
# Inner Loop: Iterate per Employee
for (e_id in 1:n_employees) {
# Metrics Generation
salary <- round(rnorm(1, mean = 5000, sd = 1200), 2)
perf_score <- round(runif(1, min = 50, max = 100), 1)
# Conditional Logic: High performers (Score > 85) get higher KPI distributions
if (perf_score > 85) {
kpi_base <- rnorm(1, mean = 92, sd = 4)
} else {
kpi_base <- rnorm(1, mean = 75, sd = 10)
}
kpi_score <- round(min(kpi_base, 100), 1) # Cap at 100
# KPI Threshold Labeling
status <- ifelse(kpi_score > 90, "Top Performer", "Standard")
# Row Construction
row <- data.frame(
company_id = company_name,
employee_id = paste0("EMP-", c_id, "-", sprintf("%03d", e_id)),
department = sample(departments, 1),
salary = salary,
performance_score = perf_score,
kpi_score = kpi_score,
performance_tier = status
)
full_dataset <- rbind(full_dataset, row)
}
}
return(full_dataset)
}
# --- Execute Simulation ---
set.seed(42)
raw_data <- generate_company_data(n_company = 5, n_employees = 50)
# --- Create Summary Table ---
# Aggregating metrics to interpret company-level trends
company_summary <- raw_data %>%
group_by(company_id) %>%
summarise(
avg_salary = round(mean(salary), 2),
avg_performance = round(mean(performance_score), 2),
max_kpi = max(kpi_score),
top_performers = sum(performance_tier == "Top Performer")
)
# Display the summary table
print(company_summary)## # A tibble: 5 × 5
## company_id avg_salary avg_performance max_kpi top_performers
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Company A 4934. 73.0 98.5 10
## 2 Company B 4830. 75.6 95.9 13
## 3 Company C 4689. 73.4 100 12
## 4 Company D 4945. 77.2 97.2 16
## 5 Company E 5027. 73.1 100 16
| Company ID | Avg Salary ($) | Avg Performance | Max KPI | Top Performers |
|---|---|---|---|---|
| Company A | 4,934 | 73.0 | 98.5 | 10 |
| Company B | 4,830 | 75.6 | 95.9 | 13 |
| Company C | 4,689 | 73.4 | 100.0 | 12 |
| Company D | 4,945 | 77.2 | 97.2 | 16 |
| Company E | 5,027 | 73.1 | 100.0 | 16 |
Visualization
highchart() %>%
hc_chart(type = "column", backgroundColor = "#FCFCFC") %>%
hc_title(text = "<b>Company Performance & Salary Insights</b>",
style = list(color = "#2c3e50", useHTML = TRUE)) %>%
hc_subtitle(text = "Aggregated simulation of salary vs performance metrics") %>%
hc_xAxis(categories = company_summary$company_id) %>%
hc_yAxis_multiples(
list(title = list(text = "Average Salary ($)"), opposite = FALSE),
list(title = list(text = "Avg Performance Score"), opposite = TRUE)
) %>%
hc_add_series(name = "Avg Salary", data = company_summary$avg_salary, yAxis = 0, color = "#1abc9c") %>%
hc_add_series(name = "Avg Performance", data = company_summary$avg_performance, yAxis = 1, color = "#3498db") %>%
hc_tooltip(shared = TRUE, crosshairs = TRUE) %>%
hc_add_theme(hc_theme_smpl())raw_data %>%
hchart("scatter", hcaes(x = performance_score, y = kpi_score, group = company_id)) %>%
hc_title(text = "<b>Performance vs. KPI Correlation</b>") %>%
hc_subtitle(text = "Distribution of employees across all simulated companies") %>%
hc_colors(c("#1abc9c", "#3498db", "#9b59b6", "#e67e22", "#e74c3c")) %>%
hc_xAxis(title = list(text = "General Performance Score")) %>%
hc_yAxis(title = list(text = "Specific KPI Score")) %>%
hc_plotOptions(scatter = list(marker = list(radius = 4, symbol = "circle")))Interpretation
The simulation successfully generated a dataset for 5 companies and 250 employees using a nested looping approach to maintain a clear organizational hierarchy. A conditional logic was implemented to create a positive correlation between performance and results; specifically, employees with a performance score above 85 have a significantly higher probability of achieving a KPI score greater than 90. Furthermore, the financial data follows a normal distribution (\(\mu=5000, \sigma=1200\)), providing a realistic variance in compensation levels across the simulated workforce.
5. Monte Carlo Simulation: Pi & Probability
This script defines the monte_carlo_pi function to estimate the value of \(\pi\) and analyze the spatial probability of random points. It utilizes a loop to generate coordinate points and applies conditional logic to process the simulation results.
Implementation
# --- Libraries ---
library(dplyr)
library(highcharter)
# --- Monte Carlo Pi Function ---
monte_carlo_pi <- function(n_points) {
# Generate random coordinates in a 1x1 square [0,1]
set.seed(123)
x <- runif(n_points, min = 0, max = 1)
y <- runif(n_points, min = 0, max = 1)
# Calculate distance from origin (0,0) to check if inside the quarter circle
distance_squared <- x^2 + y^2
is_inside <- distance_squared <= 1
# Probability analysis: Points falling in a specific sub-square (e.g., 0 < x,y < 0.5)
is_in_sub_square <- (x <= 0.5 & y <= 0.5)
prob_sub_square <- sum(is_in_sub_square) / n_points
# Estimate Pi: (Points in Circle / Total Points) * 4
pi_estimate <- (sum(is_inside) / n_points) * 4
# Store results in a dataframe for visualization
sim_data <- data.frame(
x = x,
y = y,
status = ifelse(is_inside, "Inside Circle", "Outside Circle"),
in_sub_square = is_in_sub_square
)
return(list(data = sim_data, pi = pi_estimate, prob_sub = prob_sub_square))
}
# --- Execute Simulation ---
# Using 5,000 points for a balance of precision and visualization performance
results <- monte_carlo_pi(n_points = 5000)
# --- Summary Statistics ---
cat("Estimated Pi Value:", results$pi, "\n")## Estimated Pi Value: 3.1816
## Actual Pi Value: 3.141593
## Error Margin: 0.04000735
## Prob. in Sub-Square (0.5x0.5): 25.06 %
Visualization
# --- Visualization (Highcharter) ---
highchart() %>%
hc_chart(type = "scatter", zoomType = "xy", backgroundColor = "#F9F9F9") %>%
hc_title(text = "<b>Monte Carlo Simulation: Estimating Pi</b>",
style = list(color = "#2c3e50", useHTML = TRUE)) %>%
hc_subtitle(text = paste("Total Points:", nrow(results$data), "| Estimated Pi:", results$pi)) %>%
hc_xAxis(title = list(text = "X Coordinate"), min = 0, max = 1) %>%
hc_yAxis(title = list(text = "Y Coordinate"), min = 0, max = 1, height = "100%", width = "100%") %>%
hc_add_series(results$data, "scatter", hcaes(x = x, y = y, group = status),
marker = list(radius = 2)) %>%
hc_colors(c("#3498db", "#e74c3c")) %>% # Blue for Inside, Red for Outside
hc_tooltip(pointFormat = "X: {point.x:.3f}<br>Y: {point.y:.3f}<br>{point.status}") %>%
hc_plotOptions(scatter = list(states = list(hover = list(enabled = TRUE))))Interpretation
The simulation utilizes stochastic logic by evaluating the ratio between the area of a circle (\(A = \pi r^2\)) and its circumscribed square (\(A = (2r)^2\)); by generating random points, the proportion of those falling inside the circle relative to the total approximates \(\frac{\pi}{4}\). This is supported by a probability analysis of the sub-square, which confirms a geometric probability of approximately 25% (\(0.5 \times 0.5\)), validating the uniformity of the random number generator used in the script. Ultimately, as the number of iterations (\(n\)) increases, the convergence of the estimated value toward the true mathematical constant demonstrates the Law of Large Numbers in action, ensuring higher precision with larger datasets.
6. Advanced Data Transformation & Feature Engineering
This script defines the normalize_columns and z_score functions to standardize datasets and generate new analytical features. It utilizes loop-based transformations to process multiple data columns efficiently and applies conditional logic to categorize data.
Implementation
# Load necessary library for interactive charts
# install.packages("highcharter")
library(highcharter)
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
# Seed for reproducibility
set.seed(2026)
# Generate synthetic dataset
df <- data.frame(
employee_id = 1:100,
salary = rnorm(100, mean = 5500, sd = 1200),
performance_score = runif(100, min = 10, max = 95)
)
normalize_columns <- function(df) {
# Identify numeric columns (excluding ID)
cols_to_fix <- names(df)[sapply(df, is.numeric) & names(df) != "employee_id"]
for (col in cols_to_fix) {
min_val <- min(df[[col]], na.rm = TRUE)
max_val <- max(df[[col]], na.rm = TRUE)
df[[paste0(col, "_normalized")]] <- (df[[col]] - min_val) / (max_val - min_val)
}
return(df)
}
z_score <- function(df) {
cols_to_fix <- names(df)[sapply(df, is.numeric) & !grepl("_normalized|employee_id", names(df))]
for (col in cols_to_fix) {
mu <- mean(df[[col]], na.rm = TRUE)
sigma <- sd(df[[col]], na.rm = TRUE)
df[[paste0(col, "_zscore")]] <- (df[[col]] - mu) / sigma
}
return(df)
}
# Apply transformations
df <- normalize_columns(df)
df <- z_score(df)
# Feature Creation: Performance Categories and Salary Brackets
df$performance_category <- ifelse(df$performance_score >= 80, "High Performer",
ifelse(df$performance_score >= 50, "Average", "Below Average"))
df$salary_bracket <- cut(df$salary,
breaks = quantile(df$salary, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)Visualization
# 1. Prepare Histogram Data for Highcharts
# We calculate the bins for both Raw and Z-Score data
h_raw <- hist(df$salary, plot = FALSE, breaks = 15)
h_z <- hist(df$salary_zscore, plot = FALSE, breaks = 15)
# 2. Generate the High-End Interactive Histogram
highchart() %>%
hc_chart(backgroundColor = "#FAFAFA", zoomType = "xy") %>%
hc_title(text = "<b>Distribution Analysis: Raw vs. Standardized</b>",
style = list(fontSize = "22px", color = "#2c3e50", fontFamily = "Helvetica")) %>%
hc_subtitle(text = "Interactive Histogram showing Frequency Bins") %>%
# Dual X-Axes for comparison
hc_xAxis_multiples(
list(title = list(text = "Raw Salary Values"), col = "#3498db", opposite = FALSE),
list(title = list(text = "Z-Score Values"), col = "#e74c3c", opposite = TRUE)
) %>%
hc_yAxis(title = list(text = "Frequency (Count)")) %>%
# Adding Raw Salary Series (using the first X-axis)
hc_add_series(name = "Raw Salary",
data = h_raw$counts,
type = "column",
xAxis = 0,
color = "#3498db",
pointPadding = 0,
groupPadding = 0.1) %>%
# Adding Z-Score Series (using the second X-axis)
hc_add_series(name = "Z-Score Salary",
data = h_z$counts,
type = "column",
xAxis = 1,
color = "#e74c3c",
pointPadding = 0,
groupPadding = 0.1) %>%
hc_plotOptions(column = list(
borderRadius = 2,
borderWidth = 0.5,
borderColor = "#FFFFFF",
tooltip = list(pointFormat = "<b>Frequency:</b> {point.y}")
)) %>%
hc_tooltip(shared = FALSE, crosshairs = TRUE) %>%
hc_legend(enabled = TRUE) %>%
hc_exporting(enabled = TRUE)hcboxplot(x = df$performance_score, var = df$salary_bracket, outliers = TRUE, color = "#1abc9c") %>%
hc_chart(type = "column", backgroundColor = "#FFFFFF") %>%
hc_title(text = "<b>Performance Distribution by Salary Bracket</b>") %>%
hc_subtitle(text = "Engineered categorical feature comparison") %>%
hc_xAxis(title = list(text = "Salary Bracket (Engineered)")) %>%
hc_yAxis(title = list(text = "Performance Score")) %>%
hc_plotOptions(series = list(animation = list(duration = 1500)))## Warning in hcboxplot(x = df$performance_score, var = df$salary_bracket, : 'hcboxplot' is deprecated.
## Use 'data_to_boxplot' instead.
## See help("Deprecated")
Interpretation
The data transformation process utilized Min-Max Normalization to squash features into a fixed \([0, 1]\) range, which is essential for algorithms sensitive to variable magnitude, while Z-score Standardization was applied to center the data around a mean (\(\mu = 0\)) to facilitate the identification of statistical outliers. Through feature engineering logic, continuous numerical noise was transformed into meaningful business segments specifically performance category and salary bracket enabling categorical comparisons such as whether “High” performers consistently align with the “Executive” salary bracket. Ultimately, visual insights from the histogram comparison demonstrate that while the underlying shape of the distribution remains intact, the scale is shifted to a standard format, allowing for a direct and balanced comparison between disparate units, such as salary in dollars versus performance in percentages.
7. Mini Project: Company KPI Dashboard & Simulation
This script generates a comprehensive dataset for multiple companies (5-10) and their employees (50-200 each) to build an automated performance dashboard. It utilizes nested loops and conditional logic to simulate complex organizational structures and performance metrics.
Implementation
## Warning: package 'reactable' was built under R version 4.5.2
# --- Data Generation Function ---
generate_mini_project_data <- function(n_companies = 7, min_emp = 50, max_emp = 200) {
full_data <- data.frame()
depts <- c("Engineering", "Data Science", "Sales", "HR", "Product")
for (i in 1:n_companies) {
comp_name <- paste("Company", LETTERS[i])
n_emp <- sample(min_emp:max_emp, 1)
for (j in 1:n_emp) {
# Random generation with normal and uniform distributions
salary <- round(rnorm(1, mean = 6000, sd = 1500), 2)
perf_score <- round(runif(1, 40, 100), 1)
kpi_score <- round(pmin(100, perf_score * rnorm(1, 1.02, 0.08)), 1)
row <- data.frame(
employee_id = paste0("EMP-", i, "-", j),
company_id = comp_name,
department = sample(depts, 1),
salary = salary,
performance_score = perf_score,
KPI_score = kpi_score
)
full_data <- rbind(full_data, row)
}
}
return(full_data)
}
set.seed(2026)
master_df <- generate_mini_project_data()
# --- Categorization Loop ---
master_df$KPI_tier <- NA
for (i in 1:nrow(master_df)) {
score <- master_df$KPI_score[i]
if (score >= 90) {
master_df$KPI_tier[i] <- "Elite"
} else if (score >= 75) {
master_df$KPI_tier[i] <- "High Achiever"
} else if (score >= 55) {
master_df$KPI_tier[i] <- "Core"
} else {
master_df$KPI_tier[i] <- "Underperforming"
}
}
# --- Summary Table Calculation ---
company_summary <- master_df %>%
group_by(company_id) %>%
summarise(
avg_salary = round(mean(salary), 2),
avg_KPI = round(mean(KPI_score), 2),
top_performers_count = sum(KPI_tier == "Elite"),
total_staff = n()
)
# Outputting the Table
reactable(
company_summary,
bordered = TRUE, striped = TRUE, highlight = TRUE,
columns = list(
avg_salary = colDef(name = "Avg Salary ($)", format = colFormat(separators = TRUE)),
avg_KPI = colDef(name = "Avg KPI Score"),
top_performers_count = colDef(name = "Elite Count", style = list(color = "#2ecc71", fontWeight = "bold")),
total_staff = colDef(name = "Total Employees")
)
)Visualization
# Summarizing KPI by Department
dept_data <- master_df %>%
group_by(department) %>%
summarise(mean_kpi = mean(KPI_score))
hchart(dept_data, "column", hcaes(x = department, y = mean_kpi), name = "Average KPI") %>%
hc_title(text = "<b>KPI Performance by Department</b>") %>%
hc_colors("#3498db") %>%
hc_add_theme(hc_theme_smpl())# Creating Scatter Plot with Linear Regression Trend
hchart(master_df, "scatter", hcaes(x = performance_score, y = salary, group = company_id)) %>%
hc_title(text = "<b>Salary Distribution vs. Performance Score</b>") %>%
hc_subtitle(text = "Analysis of compensation fairness across corporate entities") %>%
hc_xAxis(title = list(text = "Performance Rating")) %>%
hc_yAxis(title = list(text = "Salary ($)")) %>%
hc_plotOptions(scatter = list(marker = list(radius = 3))) %>%
hc_add_series(master_df, "line", hcaes(x = performance_score, y = predict(lm(salary ~ performance_score, data = master_df))),
name = "Market Trend", color = "black", dashStyle = "Dash")Interpretation
The automated dataset generation effectively simulated an organizational structure of 7 companies with varying workforce sizes, ensuring the data reflects real world variability. Through advanced loop based categorization, the workforce was successfully segmented into four distinct KPI tiers, revealing that “Elite” performers are the primary drivers of company-wide averages. The interactive project dashboards highlight a clear departmental variance; specifically, the scatter analysis with the regression trendline suggests a positive correlation between performance and compensation, demonstrating the power of using stochastic modeling to inform corporate human capital strategies.