Advanced Praticum:Functions,Loops and Data Science Simulation
Praticum Week 5
1 Introduction
This report is prepared as part of the assignment for the Data Science Programming course. The main focus of this practical session is to integrate fundamental programming concepts, such as functions, nested loops, and conditional logic, into a more complex Data Science workflow.
The objective of this practical work is to simulate real-world scenarios where a Data Scientist often needs to create automated tools to process large-scale data. This report contains eight main implementation points that will be discussed, including:
Multi-Formula Functions: We will build a dynamic function capable of calculating various mathematical models, from linear to exponential models, all within a single iterative process.
Sales & Discount Simulation: By applying nested loops, we will simulate daily transaction data for several sales representatives and implement automatic discount applications.
Performance Categorization: We will categorize data into five performance levels based on specific statistical criteria, making it easier to understand varying quality levels of performance.
Multi-Company Simulation: We will generate synthetic datasets consisting of complex data for multiple companies simultaneously, including information on salaries, departments, and KPI scores.
Monte Carlo Simulation: We will use a stochastic approach to estimate the value of the mathematical constant π by distributing random points.
Data Transformation: The process of data normalization and feature engineering will be conducted to prepare data before moving on to deeper analytical stages.
KPI Dashboard: We will create more sophisticated visualizations, such as regression graphs and grouped bar charts, to gain deeper insights from company data.
Automated Reporting: Lastly, we will develop a function that automatically generates individual reports for each company entity in separate file formats.
Through our simulation approach, it is hoped that our understanding of data structures and code efficiency will improve, while also producing documentation that is easy to understand and appears professional. We aspire that by completing this practical work, we will be better prepared to face real world challenges in the field of Data Science.
2 Task 1 - Dynamic Multi-Formula Function
Linear: (y = 2x + 1) This function is linear, indicating a direct relationship between (x) and (y). The coefficient 2 means that for every unit increase in (x), (y) increases by 2. The constant 1 indicates that when (x = 0), (y) equals 1. Output: Results show that (y) rises steadily with increasing (x), forming a straight line on a graph.
Quadratic: (y = x^2 + 2x + 1) This is a quadratic function that forms a parabola. It reaches its minimum at (x = -1) and is symmetrical. The coefficient of (x^2) (which is 1) indicates that the parabola opens upwards. Output: As (x) moves away from the minimum, either increasing or decreasing, (y) rises rapidly, showing the characteristic of a parabola.
Cubic: (y = x^3 - 3x^2 + 3x - 1) This cubic function can have multiple inflection points. It features one local maximum and one local minimum. The graph’s shape can vary significantly based on the coefficients. Output: The function demonstrates that (y) can increase sharply, decrease, and then increase again, resembling an “S” shape when graphed.
Exponential: (y = e^x) This exponential function represents extremely rapid growth. It’s commonly applied in contexts like population growth and finance. Output: The output indicates that (y) rises quickly with increasing (x), becoming very large as (x) grows.
# ==========================================================================
# MATHEMATICAL MODELING: GROWTH RATE COMPARISON
# Focus: Linear, Polynomial, and Exponential Complexity
# ==========================================================================
# 1. LOAD ESSENTIAL LIBRARIES ==============================================
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)
# 2. DATA GENERATION (Optimized Vectorization) =============================
# Using a 0.5 step for a high-definition smooth curve
x_values <- seq(1, 20, by = 0.5)
math_data <- data.frame(x = x_values) %>%
mutate(
Linear = 2 * x + 1,
Quadratic = x^2 + 2 * x + 1,
Cubic = x^3 - 3 * x^2 + 3 * x - 1,
Exponential = exp(x)
) %>%
# Reshaping from Wide to Long format for GGPlot compatibility
pivot_longer(cols = -x, names_to = "formula", values_to = "value")
# Ordering factors for a logical legend (Smallest to Largest growth)
math_data$formula <- factor(math_data$formula,
levels = c("Linear", "Quadratic", "Cubic", "Exponential"))
# 3. DARK ANALYTICS THEME ==========================================
modern_dark_theme <- theme_minimal(base_size = 14) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid.major = element_line(color = "#2D333F", linewidth = 0.2),
panel.grid.minor = element_blank(),
axis.title = element_text(face = "bold", color = "#FFFFFF"),
axis.text = element_text(color = "#A0A0A0"),
plot.title = element_text(face = "bold", size = 20, color = "#FFFFFF", hjust = 0.5),
plot.subtitle = element_text(size = 12, color = "#38BDF8", hjust = 0.5),
legend.position = "bottom",
legend.background = element_rect(fill = "#151921", color = NA),
legend.text = element_text(size = 11)
)
# 4. VISUALIZATION ENGINE ==================================================
math_plot <- ggplot(math_data, aes(x = x, y = value, color = formula)) +
# Continuous lines with distinct dash patterns for accessibility
geom_line(aes(linetype = formula), linewidth = 1.2, alpha = 0.85) +
# Adding markers at specific intervals to avoid cluttering the UI
geom_point(data = filter(math_data, x %% 2 == 0), size = 2.5) +
# CRITICAL: Logarithmic Y-axis to visualize exponential vs linear growth
scale_y_log10(
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))
) +
# Custom Neon Palette (Green, Orange, Pink, Cyan)
scale_color_manual(values = c("#50FA7B", "#FFB86C", "#FF79C6", "#8BE9FD")) +
labs(
title = "Mathematical Function Growth Comparison",
subtitle = "Logarithmic Scale (Base 10) for Complexity Visualization",
x = "Input (x)",
y = "Output f(x) [Log Scale]",
color = "Function Type:",
linetype = "Function Type:"
) +
modern_dark_theme
# Execute Rendering
print(math_plot)Interpretation:
This visualization compares the behavior of four mathematical functions (linear, quadratic, cubic, and exponential) for x values from 1 to 20:
1. Linear Function (y = 2x + 1): - Steady, constant growth. - Straight line with a fixed slope.
2. Quadratic Function (y = x² + 2x + 1): - Increasing growth rate. - Parabolic curve.
3. Cubic Function (y = x³ - 3x² + 3x - 1): - Rapid growth. - Steeper curve than quadratic.
4. Exponential Function (y = eˣ): - Extremely rapid growth. - Sharp increase after x > 5.
Conclusion:
The visualization highlights how different functions grow at varying rates. Linear functions grow steadily, polynomial functions (quadratic, cubic) accelerate faster, and exponential functions grow the most dramatically. Understanding these behaviors is essential for selecting appropriate models in fields like computer science, economics, and science.
3 Task 2 - Nested Simulation:Multi-Sales & Discounts
In this assignment, we will create a sales simulation involving multiple salespeople and working days. This simulation will generate sales data consisting of sales ID, the day of the sale, sales amount, and the applicable discount rates. Additionally, further data processing will be implemented to calculate total sales per salesperson and provide conditional discounts based on their sales amounts.We will also create a visualization of the cumulative total sales per salesperson.
# ==========================================================================
# SALES FORCE ANALYTICS: REVENUE & TREND TRACKER
# Focus: Vectorized Simulation and Temporal Trend Analysis
# ==========================================================================
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(scales)
simulate_sales_engine <- function(n_staff, days) {
set.seed(2026)
# 1. HIGH-SPEED VECTORIZED SIMULATION =====================================
# Generating the entire dataset instantly using expand.grid
sales_data <- expand.grid(
day = 1:days,
staff_member = paste("Agent", 1:n_staff)
) %>%
mutate(
transaction_id = row_number(),
# Modeling sales amount with a normal distribution for realistic variance
base_amount = round(rnorm(n(), mean = 2000, sd = 550) %>% pmax(400), 2),
# Dynamic discount logic based on volume
discount_pct = case_when(
base_amount > 2300 ~ 0.20,
base_amount > 1300 ~ 0.10,
TRUE ~ 0
),
net_revenue = base_amount * (1 - discount_pct)
)
# 2. KPI AGGREGATION ======================================================
performance_summary <- sales_data %>%
group_by(staff_member) %>%
summarise(
gross_total = sum(base_amount),
net_total = sum(net_revenue),
avg_ticket_size = mean(base_amount),
discount_leak = (gross_total - net_total) / gross_total,
.groups = 'drop'
) %>%
arrange(desc(net_total))
# 3. DARK ANALYTICS THEME =========================================
modern_theme <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid.major = element_line(color = "#2D333F", linewidth = 0.1),
panel.grid.minor = element_blank(),
axis.title = element_text(face = "bold", color = "#FFFFFF"),
axis.text = element_text(color = "#A0A0A0"),
plot.title = element_text(face = "bold", size = 16, color = "#FFFFFF"),
plot.subtitle = element_text(size = 10, color = "#38BDF8"),
legend.position = "none"
)
# 4. VIZ A: REVENUE RANKING (Horizontal Bar Chart) ========================
p1 <- ggplot(performance_summary, aes(x = reorder(staff_member, net_total), y = net_total, fill = staff_member)) +
geom_col(width = 0.7, alpha = 0.8) +
geom_text(aes(label = dollar(net_total, scale = 1/1000, suffix = "K")),
hjust = 1.2, color = "black", fontface = "bold", size = 4) +
scale_y_continuous(labels = label_dollar()) +
scale_fill_brewer(palette = "Set3") +
coord_flip() +
labs(title = "Total Net Revenue Ranking", subtitle = "Actual cash-in after discount deductions", x = NULL, y = "Total Revenue") +
modern_theme
# 5. VIZ B: TIME-SERIES TREND (Line Chart) ===============================
p2 <- ggplot(sales_data, aes(x = day, y = net_revenue, color = staff_member)) +
geom_line(linewidth = 1, alpha = 0.6) +
geom_point(size = 1.5, alpha = 0.8) +
scale_y_continuous(labels = label_dollar()) +
scale_color_brewer(palette = "Set3") +
labs(title = "30-Day Performance Pulse", subtitle = "Tracking daily fluctuation and consistency", x = "Day of Month", y = "Daily Revenue") +
modern_theme
# 6. DASHBOARD ASSEMBLY ==================================================
final_dashboard <- (p1 / p2) +
plot_annotation(
title = "SALES FORCE ANALYTICS REPORT 2026",
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 22, face = "bold", hjust = 0.5)
)
)
print(final_dashboard)
return(list(raw_data = sales_data, summary = performance_summary))
}
# Execute for 6 agents over a 30-day period
results <- simulate_sales_engine(n_staff = 6, days = 30)Interpretation
1. What the Chart Shows: - A bar chart showing total sales for each salesperson over 30 days. - Each bar represents a salesperson (labeled 1–5), with exact dollar amounts displayed above the bars.
2. Key Observations: - Sales performance varies significantly among salespeople. - Differences are due to random daily sales fluctuations. - Top performers are easily identifiable.
3. Practical Implications: - Helps managers quickly compare sales performance. - Highlights potential high and low performers. - Provides concrete data for evaluation.
4. Limitations: - Results are simulated and random. - Real-world data may differ. - Additional metrics (e.g., average sale size) could offer deeper insights.
5. Improvements Made: Proper data types and vectorized operations for efficiency. Enhanced visualization with dollar formatting. Added summary statistics and improved readability. Clear labels and themes for better presentation.
Conclusion:
The chart effectively communicates sales performance differences, making it a useful tool for evaluating team performance over the simulated period.
4 Task 3 - Multi-Level Performance Categorization
In this task, we will create a function called categorize_performance to categorize sales data. Each sale will be classified based on its amount. After categorization, we will calculate the percentage for each category. Finally, we will visualize the data using bar charts and pie charts to make it easier to understand.
1. Function: categorize performance(salesamount)
This function acts as a “filter”. It takes a numerical sales amount and returns a text label based on conditional logic (IF-ELSE). The classification rules are: - Excellent: Sales > 9000 - Very Good: Sales between 7001 – 9000 - Good: Sales between 5001 – 7000 - Average: Sales between 3001 – 5000 - Poor: Sales ≤ 3000
2. Percentage Calculation
After categorizing all the data, we don’t just count the number of people in each category we also calculate their percentage relative to the total population. The formula is: \[ ext{Percentage} = \frac{ ext{Count per Category}}{ ext {Total Data}} * 100 \]
3. Looping & Distribution
The program will loop through the entire sales data vector, assign a label to each entry, and then summarize the results into a distribution table.
# ==========================================================================
# STAFF PERFORMANCE ANALYTICS: DISTRIBUTION & COMPOSITION
# Focus: Tier Categorization and Visual Share Analysis
# ==========================================================================
library(ggplot2)
library(dplyr)
library(patchwork)
library(scales)
visualize_performance <- function(sales_data) {
# 1. DATA PREPARATION & CATEGORIZATION ===================================
# Transforming raw numbers into meaningful performance tiers
df <- data.frame(Value = sales_data) %>%
mutate(
Category = cut(Value,
breaks = c(-Inf, 100, 200, 300, 400, Inf),
labels = c("Below Standard", "Average", "Good", "Very Good", "Excellent"),
right = TRUE),
Category = factor(Category, levels = c("Below Standard", "Average", "Good", "Very Good", "Excellent"))
)
# 2. STATISTICAL SUMMARY =================================================
# Calculating the weight of each tier
percentages <- df %>%
count(Category) %>%
mutate(
Percentage = n / sum(n),
Label = percent(Percentage, accuracy = 0.1)
)
# 3. THEME SETUP (DARK MODE) ======================================
modern_theme <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 14, color = "#FFFFFF", hjust = 0.5),
plot.subtitle = element_text(size = 10, color = "#38BDF8", hjust = 0.5),
legend.position = "none"
)
# 4. VIZ 1: THE SPREAD (Bar Chart) =======================================
p1 <- ggplot(percentages, aes(x = Category, y = Percentage, fill = Category)) +
geom_col(width = 0.7, alpha = 0.85) +
geom_text(aes(label = Label), vjust = -0.5, color = "#FFFFFF", fontface = "bold") +
scale_y_continuous(labels = percent) +
scale_fill_brewer(palette = "RdYlGn") +
labs(title = "Performance Distribution", subtitle = "Percentage spread across categories", x = NULL, y = "Share (%)") +
modern_theme
# 5. VIZ 2: THE SHARE (Modern Donut Chart) ===============================
p2 <- ggplot(percentages, aes(x = 2, y = Percentage, fill = Category)) +
geom_bar(stat = "identity", width = 1, color = "#151921") +
coord_polar("y", start = 0) +
# Creating the center hole for the Donut aesthetic
xlim(0.5, 2.5) +
geom_text(aes(label = ifelse(Percentage > 0.05, Label, "")),
position = position_stack(vjust = 0.5), color = "black", fontface = "bold") +
scale_fill_brewer(palette = "RdYlGn") +
labs(title = "Staff Composition", subtitle = "Tier Breakdown", fill = "Category") +
theme_void() +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(face = "bold", size = 14, color = "#FFFFFF", hjust = 0.5),
plot.subtitle = element_text(size = 10, color = "#38BDF8", hjust = 0.5),
legend.position = "right",
legend.text = element_text(color = "#E0E0E0")
)
# 6. FINAL ASSEMBLY ======================================================
combined_report <- (p1 + p2) +
plot_annotation(
title = "ANNUAL SALES PERFORMANCE REPORT 2026",
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 20, face = "bold", hjust = 0.5, margin = margin(t=10))
)
)
print(combined_report)
return(list(data = df, summary = percentages))
}
# 7. EXECUTION =============================================================
sales_stats <- c(50, 150, 250, 350, 450, 220, 130, 380, 90, 240, 420, 180, 290, 310)
results <- visualize_performance(sales_stats)Interpretation:
-Bar Plot: - Shows percentage of sales in each performance category - Color gradient (red-yellow-green) helps quick assessment - Labels show exact percentages for clarity
-Pie Chart: - Displays relative proportions of each category - Useful for seeing overall composition - Consistent colors with bar plot
-Key Takeaways: Bar plot better for comparing categories Pie chart shows overall distribution Color coding highlights problem areas (red) and successes (green) Together they provide complete performance picture
5 Task 4 - Multi-Company Dataset Simulation
In this project, we will create a simulated dataset that includes several companies, each having a number of employees. Each employee will have attributes including company ID, employee ID, salary, department, performance score, and KPI (Key Performance Indicator) score. We will also calculate some summary statistics for each company, including average salary, average performance score, and maximum KPI score. This data will then be visualized in the form of tables and graphs.
This project will help us understand how to create and analyze simulation data using Python and R programming. Additionally, we will learn how to use nested loops to build the dataset and apply conditional logic to identify high-performing employees.
# ==========================================================================
# CORPORATE ANALYTICS: COMPENSATION VS. PERFORMANCE BENCHMARKING
# Focus: Salary Equity and Talent Distribution
# ==========================================================================
library(ggplot2)
library(dplyr)
library(patchwork)
library(scales)
# 1. DATA GENERATION ENGINE ================================================
generate_company_data <- function(n_company, n_employees) {
set.seed(2026)
# Create a clean, vectorized employee database
raw_data <- expand.grid(
company_id = paste("Comp", 1:n_company),
employee_id = 1:n_employees
) %>%
mutate(
department = sample(c('HR', 'IT', 'Sales', 'Marketing'), n(), replace = TRUE),
# Performance is randomized with a normal distribution for realism
performance_score = round(rnorm(n(), mean = 75, sd = 12) %>% pmin(100) %>% pmax(40), 1),
# Salary is modeled as a function of performance + market noise
salary = round(32000 + (performance_score * 850) + rnorm(n(), 0, 4000), 0) %>% pmax(30000),
KPI_score = pmin(100, performance_score + rnorm(n(), 2, 4))
)
# Aggregating company-wide metrics
summary_data <- raw_data %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_performance = mean(performance_score),
max_KPI = max(KPI_score),
headcount = n(),
.groups = 'drop'
)
return(list(data = raw_data, summary = summary_data))
}
# 2. VISUALIZATION ENGINE (DASHBOARD) ======================================
create_corporate_dashboard <- function(result_list) {
# Dark UI Theme
modern_theme <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid.major = element_line(color = "#2D333F", linewidth = 0.1),
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold", size = 16, color = "#FFFFFF"),
plot.subtitle = element_text(size = 10, color = "#38BDF8"),
axis.title = element_text(face = "bold", color = "#FFFFFF"),
axis.text = element_text(color = "#A0A0A0"),
legend.position = "none"
)
# Plot A: Average Salary by Company (Bar Chart)
p1 <- ggplot(result_list$summary, aes(x = company_id, y = avg_salary, fill = company_id)) +
geom_col(width = 0.6, alpha = 0.8) +
geom_text(aes(label = dollar(avg_salary, scale = 1/1000, suffix = "K")),
vjust = 1.5, color = "black", fontface = "bold", size = 3.5) +
scale_y_continuous(labels = label_dollar()) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Average Compensation", subtitle = "Annual salary benchmark per entity", x = NULL, y = "Salary ($)") +
modern_theme
# Plot B: Performance Distribution (Violin + Boxplot)
p2 <- ggplot(result_list$data, aes(x = company_id, y = performance_score, fill = company_id)) +
geom_violin(alpha = 0.3, color = NA) +
geom_boxplot(width = 0.15, color = "white", alpha = 0.6, outlier.shape = NA) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Performance Distribution", subtitle = "Score variance and talent density", x = NULL, y = "Score (0-100)") +
modern_theme
# Final Assembly using patchwork
dashboard <- (p1 / p2) +
plot_annotation(
title = "STRATEGIC TALENT & PAYROLL AUDIT 2026",
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 20, face = "bold", hjust = 0.5)
)
)
return(dashboard)
}
# 3. EXECUTION =============================================================
audit_results <- generate_company_data(n_company = 5, n_employees = 40)
final_dashboard <- create_corporate_dashboard(audit_results)
print(final_dashboard)Interpretation:
1. What the chart shows: Blue bars: Average salary per company Red line: Average employee performance (adjusted scale to match salary range)
2. Key findings: Higher salary ≠ better performance Easy to compare between companies Performance varies even with similar salary levels
3. Practical use: Spots companies with high pay but low performance (inefficient) Identifies companies with good performance despite lower pay (effective management) Useful reference for improvement
4. Limitations: Performance scale is adjusted (may not be precise) Simulated data (real-world results may differ)
This visualization is useful for HR and management teams to analyze the relationship between salary and performance across companies.
6 Task 5 - Monte Carlo Simulation: Pi & Probability
Monte Carlo Simulation: Estimating π and Probability Monte Carlo simulation is a statistical technique used to estimate numerical values through random sampling. In this context, we will employ the Monte Carlo method to estimate the value of π (pi) and analyze the probability of random points falling within a defined sub-square. In this simulation, we will compare the number of points that land in a circle to the area of a surrounding square.
1. Basic Concept of Estimating π
Imagine a circle with a radius \(r = 1\) placed inside a square with side length \(2r = 2\). Area of the Circle: \(\pi r^2 = \pi(1)^2 = \pi\). Area of the Square: \((2r)^2 = 2^2 = 4\). Ratio: \(\frac{ ext{Area of the Circle}}{ ext{Area of the Square}} = \frac{\pi}{4}\). From this, we can derive: \[ \pi = 4 * ( ext{Ratio of points inside the circle}) \]
2. Function: monte carlo pi (n_points)
This function will: Generate \(n\) random coordinate points \((x, y)\) where \(x\) and \(y\) are between -1 and 1. Calculate the distance of each point from the center \((0, 0)\) using the Pythagorean theorem: \[ d = \sqrt{x^2 + y^2} \] If \(d \le 1\), the point lies inside the circle.
3. Probability Analysis of Sub-Squares
Besides estimating π, we can also calculate the probability of points falling in specific areas (e.g., a small square in the top-right corner). This helps us practice data filtering logic in simulations.
# ==========================================================================
# STOCHASTIC MODELING: MONTE CARLO PI ESTIMATION
# Concept: Geometric Probability & Convergence Analysis
# ==========================================================================
library(ggplot2)
library(dplyr)
library(ggforce) # For precision circle geometry
library(patchwork) # To combine multiple plots
simulate_pi_engine <- function(n_points) {
set.seed(2026) # Ensuring reproducibility
# 1. DATA GENERATION & LOGIC =============================================
# Using a vectorized approach for high-speed simulation
df <- data.frame(
x = runif(n_points, -1, 1),
y = runif(n_points, -1, 1)
) %>%
mutate(
# The Distance Formula: sqrt(x^2 + y^2)
dist_from_center = sqrt(x^2 + y^2),
is_inside = dist_from_center <= 1,
# Calculating the running estimate to track convergence
cumulative_inside = cumsum(is_inside),
running_n = row_number(),
pi_estimate = (cumulative_inside / running_n) * 4
)
# Final Statistics
final_est <- tail(df$pi_estimate, 1)
error_pct <- abs(final_est - pi) / pi * 100
# 2. DARK UI THEME ===============================================
modern_theme <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 16, color = "#FFFFFF"),
plot.subtitle = element_text(size = 10, color = "#38BDF8"),
axis.text = element_text(color = "#A0A0A0"),
axis.title = element_text(color = "#FFFFFF", face = "bold"),
legend.position = "bottom"
)
# 3. VIZ 1: THE SIMULATION BOX ===========================================
p1 <- ggplot(df, aes(x = x, y = y, color = is_inside)) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1), inherit.aes = FALSE,
color = "#FFFFFF", linewidth = 0.8, linetype = "dashed") +
geom_point(alpha = 0.5, size = 0.8) +
scale_color_manual(values = c("#FF5555", "#50FA7B"),
labels = c("Outside (Miss)", "Inside (Hit)")) +
coord_fixed() +
labs(title = "Geometric Probability", subtitle = "Ratio of Hits vs. Total Samples", color = "Status:") +
modern_theme + theme(axis.text = element_blank(), axis.title = element_blank())
# 4. VIZ 2: CONVERGENCE TRACKER ==========================================
p2 <- ggplot(df, aes(x = running_n, y = pi_estimate)) +
geom_line(color = "#38BDF8", linewidth = 1) +
geom_hline(yintercept = pi, color = "#FF79C6", linetype = "dotted", linewidth = 1) +
annotate("text", x = n_points*0.8, y = pi + 0.1, label = "True Pi", color = "#FF79C6") +
labs(title = "Convergence Analysis", subtitle = "Estimate stability over time", x = "Iterations", y = "Est. Value") +
modern_theme
# 5. ASSEMBLY ============================================================
combined_viz <- (p1 | p2) +
plot_annotation(
title = paste("MONTE CARLO PI ESTIMATION:", round(final_est, 5)),
subtitle = paste("Sample Size:", format(n_points, big.mark=","), "| Error:", round(error_pct, 4), "%"),
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "#38BDF8", size = 12, hjust = 0.5)
)
)
print(combined_viz)
return(list(estimate = final_est, error = error_pct))
}
# Execute with 5k points for a clean, sharp look
simulation_results <- simulate_pi_engine(5000)
Interpretation:
1. What we see: Blue dots: Points inside unit circle (x² + y² ≤ 1) Red dots: Points outside Black circle: Unit circle boundary
2. How it works: Random points in 2×2 square Count points inside circle to estimate π More points = better accuracy
3.Key takeaways: Visual demo of Monte Carlo method Shows probability → geometry connection Simple but powerful concept
4.Limitations: Needs many points for good accuracy Basic version (not for precise calc)
7 Task 6 - Advanced Data Transformation & Feature Engineering
Advanced Data Transformation and Feature Engineering
In this section, we will perform more complex data transformations and create new features that can enhance our data analysis model. We will utilize two main functions, normalize_columns(df) and z_score(df), to normalize columns within the dataset. Additionally, we will create new features such as performance category and salary range. Afterward, we will compare the data distribution before and after transformation and visualize it using histograms and box plots.
# 1. LOAD EXTENDED LIBRARIES ===============================================
library(ggplot2)
library(dplyr)
library(patchwork)
library(scales)
library(ggtext)
# 2. DATA PREPARATION ======================================================
set.seed(123)
employee_data <- data.frame(
salary = rnorm(200, mean = 55000, sd = 12000),
performance = rnorm(200, mean = 70, sd = 15)
) %>%
mutate(
# Min-Max Scaling Logic
salary_norm = (salary - min(salary)) / (max(salary) - min(salary)),
# Categorical Performance Binning
performance_cat = cut(performance,
breaks = c(-Inf, 50, 70, 85, Inf),
labels = c("Low", "Average", "Good", "Excellent"))
)
# 3. GLOBAL DESIGN TOKENS (Cyberpunk Dark Mode) ============================
bg_dark <- "#0f172a"
panel_dark <- "#1e293b"
text_light <- "#f8fafc"
neon_blue <- "#38bdf8"
neon_pink <- "#f472b6"
neon_green <- "#4ade80"
neon_yellow <- "#fbbf24"
# Standardized Theme for all plots
cyber_theme <- theme_minimal(base_family = "sans") +
theme(
plot.background = element_rect(fill = bg_dark, color = NA),
panel.background = element_rect(fill = panel_dark, color = NA),
text = element_text(color = text_light),
plot.title = element_markdown(face = "bold", size = 14),
plot.subtitle = element_markdown(size = 9, color = "grey60", margin = margin(b=12)),
panel.grid.major = element_line(color = "#334155", linewidth = 0.4),
panel.grid.minor = element_blank(),
axis.text = element_text(color = "grey50", size = 8),
axis.title = element_text(face = "bold", size = 9),
legend.position = "none"
)
# 4. COMPONENT FUNCTIONS & VISUALS =========================================
# A. Function to create distribution plots safely
create_dist_plot <- function(data, x_col, color, title, label_func = identity) {
ggplot(data, aes(x = .data[[x_col]])) +
geom_histogram(aes(y = after_stat(density)), fill = color, alpha = 0.2, bins = 25) +
geom_density(color = color, linewidth = 1) +
scale_x_continuous(labels = label_func) +
labs(title = title) +
cyber_theme +
theme(axis.title = element_blank(),
panel.background = element_rect(fill = "#111827", color = NA))
}
# B. Generate Individual Plots
# Scatter
p_main <- ggplot(employee_data, aes(x = salary, y = performance)) +
geom_point(aes(fill = performance_cat), size = 3.5, shape = 21, color = bg_dark, alpha = 0.8) +
geom_smooth(method = "lm", color = neon_blue, linetype = "solid", se = TRUE, fill = neon_blue, alpha = 0.1) +
scale_fill_manual(values = c("#ef4444", neon_yellow, neon_green, neon_blue)) +
scale_x_continuous(labels = label_dollar()) +
labs(
title = "Performance vs. Salary <span style='color:#38bdf8;'>Correlation</span>",
subtitle = "Mapping <span style='color:#4ade80;'>High Performers</span> against compensation scales",
x = "Annual Gross Salary", y = "KPI Score (0-100)"
) +
cyber_theme
# Generate sidebar plots using the function
p_dist1 <- create_dist_plot(employee_data, "salary", neon_blue,
"Raw Salary Distribution", label_dollar())
p_dist2 <- create_dist_plot(employee_data, "salary_norm", neon_pink,
"Normalized Salary [0, 1]")
# 5. FINAL ASSEMBLY ========================================================
# Combine sidebar vertically
sidebar <- p_dist1 / p_dist2
# Combine Main plot and Sidebar horizontally
final_dashboard <- (p_main + sidebar) +
plot_layout(widths = c(2.2, 1)) +
plot_annotation(
title = "HR ANALYTICS: DATA TRANSFORMATION ENGINE",
subtitle = "Technical Dashboard: Impact of Min-Max Normalization on Talent Metrics",
caption = "Confidential HR Report | 2026 Simulation | R-Patchwork System",
theme = theme(
plot.background = element_rect(fill = bg_dark, color = NA),
plot.title = element_text(color = "white", size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = neon_blue, size = 10, hjust = 0.5),
plot.caption = element_text(color = "grey40", size = 7, hjust = 0.95)
)
)
# Render output
print(final_dashboard)Interpretation:
1. What the Graphs Show:
Left (Blue): Real salary data ($30k-$70k). Most people earn around $50k.
Right (Orange): Scaled data (0-1). Same shape as original but easier to compare.
2. Why It’s Useful: Normalization makes different data types (e.g., salary vs. performance) comparable. Puts everything on a 0-1 scale while keeping the original pattern.
3. Key Points: Shape stays the same after scaling. Most common salaries still show up as tallest bars. Only the scale (x-axis) changes, not the data’s story.
4. Practical Use: Helps in machine learning and analysis. Makes comparing different metrics (like salary & performance) easier.
5. Limitations: Doesn’t change the data’s distribution. Original values (dollar amounts) are lost. Extreme values can still skew the results.
8 Task 7 - Mini Project:Company KPI Dashboard & Simulation
Executive Workforce Analytics 2026
In this project, we simulated a dataset across 5 to 8 different companies to analyze the relationship between employee performance (KPI) and their compensation (Salary). As a Data Science student, the main objective of this dashboard is to bridge the gap between raw data and strategic insights, providing HR managers or CEOs with a data-driven foundation for high-level decision-making.
The dashboard focuses on three key pillars:
1. ROI Analysis: Investigating if “money” actually correlates with output—basically checking if higher pay consistently leads to better performance.
2. Market Benchmarking: Comparing salary standards across multiple entities to ensure the organization remains competitive in the talent market.
3. Operational Deep-Dive: Breaking down performance by department to identify “powerhouse” teams versus those that need extra support, such as targeted training or performance evaluations.
# ==========================================================================
# FINAL COMPREHENSIVE KPI DASHBOARD (UNIFIED & SCALED)
# ==========================================================================
# 1. LIBRARIES & DATA GENERATION ===========================================
library(ggplot2); library(dplyr); library(patchwork); library(scales); library(ggtext)
set.seed(2026)
n_companies <- sample(5:8, 1)
companies <- paste("Company", LETTERS[1:n_companies])
depts <- c("Engineering", "Sales", "Marketing", "HR", "Finance")
raw_data <- data.frame()
for(comp in companies) {
n_emp <- sample(80:120, 1)
temp_df <- data.frame(
company_id = comp,
department = sample(depts, n_emp, replace = TRUE),
salary = round(rnorm(n_emp, 65000, 12000), 0) %>% pmax(30000),
performance_score = round(runif(n_emp, 45, 100), 1)
)
raw_data <- rbind(raw_data, temp_df)
}
raw_data <- raw_data %>%
mutate(KPI_score = pmin(100, performance_score * 0.82 + rnorm(n(), 12, 4))) %>%
mutate(KPI_tier = cut(KPI_score, breaks=c(0,60,75,90,100),
labels=c("Below Exp.", "Average", "High", "Elite")))
# 2. OPTIMIZED VISUAL THEME ================================================
# Using a slightly smaller base size so things fit perfectly in one view
unified_theme <- theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid.major = element_line(color = "#2D333F", linewidth = 0.15),
panel.grid.minor = element_blank(),
axis.title = element_text(face = "bold", size = 10, color = "#FFFFFF"),
axis.text = element_text(color = "#A0A0A0", size = 9),
plot.title = element_markdown(face = "bold", size = 14, color = "#FFFFFF"),
plot.subtitle = element_text(size = 10, color = "#38BDF8"),
legend.position = "top",
legend.text = element_text(size = 8),
plot.margin = margin(10, 10, 10, 10)
)
# A. REGRESSION PLOT (Top Left)
p1 <- ggplot(raw_data, aes(x = salary, y = KPI_score, color = company_id)) +
geom_point(alpha = 0.4, size = 1.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
scale_x_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
scale_color_brewer(palette = "Set3") +
labs(title = "ROI: **Salary vs KPI**", x = "Salary", y = "Score") +
unified_theme
# B. DENSITY PLOT (Top Right)
p2 <- ggplot(raw_data, aes(x = salary, fill = company_id)) +
geom_density(alpha = 0.5, color = "#FFFFFF", linewidth = 0.3) +
scale_x_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Market **Salary Spread**", x = "Salary Range", y = "Density") +
unified_theme + theme(legend.position = "none")
# C. WIDE BAR CHART (Bottom Row - Full Width)
p3 <- raw_data %>%
group_by(department, KPI_tier) %>%
tally() %>%
ggplot(aes(x = department, y = n, fill = KPI_tier)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
geom_text(aes(label = n), position = position_dodge(width = 0.8),
vjust = -0.5, size = 3, color = "white", fontface = "bold") +
scale_fill_manual(values = c("#FF5555", "#FFB86C", "#50FA7B", "#8BE9FD")) +
labs(title = "Departmental **Talent Benchmarking**",
subtitle = "Headcount across tiers",
x = "Department", y = "Staff Count", fill = "KPI Tier") +
unified_theme +
theme(axis.text.x = element_text(size = 11, face = "bold"))
# 3. ASSEMBLY (Combining everything into one balanced dashboard) ===========
# Using patchwork to define a 2x2 grid where the bottom plot takes both columns
final_dashboard <- (p1 + p2) / p3 +
plot_layout(heights = c(1, 1.2)) + # Make the bar chart slightly taller
plot_annotation(
title = "EXECUTIVE WORKFORCE ANALYTICS DASHBOARD 2026",
subtitle = "Consolidated Analysis: ROI Correlation, Compensation Spread, and Talent Tiers",
caption = "Data Simulation | Analytics Portfolio 2026",
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "#38BDF8", size = 12, hjust = 0.5, margin = margin(b = 15)),
plot.caption = element_text(color = "#475569", size = 9, hjust = 1)
)
)
# Display the result
print(final_dashboard)Interpretation:
1. ROI: Salary vs KPI Score
This scatter plot tracks the “Pay-for-Performance” relationship. The upward-sloping trend lines show a clear positive correlation: as salaries increase, KPI scores generally follow. However, by looking at the different company lines, we can see who is getting the best “bang for their buck”—some companies achieve “Elite” performance even with mid-range compensation, while others are overpaying for average results.
2. Market Salary Spread
This density plot acts as the “Payroll DNA” of the simulation. Most companies peak around the $65K mark, following a standard Bell Curve. A wider curve means the company has a broad salary range (high inequality but high growth potential), while a sharp, narrow peak suggests a very rigid and standardized pay scale.
3. Departmental Talent Benchmarking
This is the “Operational Health Check.” It breaks down exactly where the “Rockstars” and the “Underperformers” are located.
High Blue Bars (Elite): These are your powerhouse departments (often Engineering or Sales).
High Red/Orange Bars (Below Exp./Average): These indicate “Talent Bottlenecks” where teams might need better training or a recruitment refresh to stay competitive.
9 TASK 8 - Automated Report Generation (Bonus)
Auto-Pilot Reporting Engine
For this part of the project, we’re leveling up from manual work to building a full-blown Automated Reporting Engine. Instead of wasting hours copy-pasting charts for every single company, we’ve coded a ‘Master Function’ that does all the heavy lifting.
Basically, the script loops through the entire database, filters the data for each company, calculates the KPIs, and builds a custom high-def dashboard on the fly. The best part? It auto-saves everything into ready-to-use PNGs and CSV summaries. It’s peak efficiency for any Data Science portfolio.
# ==========================================================================
# SCALABLE CORPORATE REPORTING ENGINE (AUTOMATED)
# Focus: Performance, Compensation, and Operational Audit
# ==========================================================================
# 1. LOAD ESSENTIAL LIBRARIES ==============================================
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(scales)
library(ggtext)
library(readr)
# 2. SYNTHETIC DATA GENERATION =============================================
set.seed(2026)
n_companies <- 6
depts <- c("Engineering", "Sales", "Marketing", "HR", "Finance")
# Using expand.grid for a clean, vectorized data structure
raw_data <- expand.grid(
company_name = paste("Global Corp", LETTERS[1:n_companies]),
employee_id = 1:60
) %>%
mutate(
department = sample(depts, n(), replace = TRUE),
performance = round(runif(n(), 40, 100), 1),
# Realistic salary simulation linked to performance
salary = round(38000 + (performance * 820) + rnorm(n(), 0, 4500), 0),
kpi_score = pmin(100, performance + rnorm(n(), 5, 4))
)
# 3. CORE REPORTING ARCHITECTURE ===========================================
# This "Master Function" handles the heavy lifting for each entity
generate_company_report <- function(target_company, data) {
# Filter data for the specific target entity
comp_data <- data %>% filter(company_name == target_company)
# A. COMPUTE STRATEGIC METRICS
metrics <- comp_data %>%
summarise(
avg_salary = mean(salary),
avg_perf = mean(performance),
staff_count = n(),
top_dept = names(which.max(table(department))),
roi_ratio = sum(kpi_score) / (sum(salary) / 100000)
)
# B. DARK ANALYTICS THEME
modern_theme <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
panel.background = element_rect(fill = "#151921", color = NA),
text = element_text(color = "#E0E0E0"),
panel.grid.major = element_line(color = "#2D333F", linewidth = 0.1),
panel.grid.minor = element_blank(),
plot.title = element_markdown(face = "bold", size = 16, color = "#FFFFFF"),
plot.subtitle = element_text(size = 10, color = "#38BDF8"),
axis.text = element_text(color = "#A0A0A0"),
axis.title = element_text(color = "#FFFFFF", face = "bold"),
legend.position = "none"
)
# C. VIZ 1: SALARY STRUCTURE (Boxplot Audit)
p1 <- ggplot(comp_data, aes(x = department, y = salary, fill = department)) +
geom_boxplot(alpha = 0.7, color = "white", outlier.colour = "#FF5555", linewidth = 0.6) +
scale_y_continuous(labels = label_dollar(prefix = "$", suffix = "")) +
scale_fill_brewer(palette = "Set3") +
labs(title = "**Pay Distribution** by Dept", subtitle = "Auditing base salary across internal teams", x = NULL, y = "Annual Salary") +
modern_theme
# D. VIZ 2: PERFORMANCE CORRELATION (Regression Analysis)
p2 <- ggplot(comp_data, aes(x = performance, y = kpi_score, color = performance)) +
geom_point(size = 2.5, alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x, color = "#38BDF8", se = FALSE, linewidth = 1.2) +
scale_color_viridis_c(option = "plasma") +
labs(title = "**ROI Check:** Perf vs KPI", subtitle = "Testing the reliability of internal score metrics", x = "Performance Rating", y = "Actual KPI Output") +
modern_theme
# E. ASSEMBLY (Dashboard Layout)
dashboard <- (p1 | p2) +
plot_annotation(
title = paste("EXECUTIVE AUDIT:", toupper(target_company)),
subtitle = paste("Avg Salary:", dollar(metrics$avg_salary), "| Headcount:", metrics$staff_count, "| Primary Hub:", metrics$top_dept),
theme = theme(
plot.background = element_rect(fill = "#0B0E14", color = NA),
plot.title = element_text(color = "#FFFFFF", size = 22, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "#38BDF8", size = 12, hjust = 0.5, margin = margin(b = 20))
)
)
# F. AUTOMATED EXPORT SYSTEM
# 1. Export High-Res Viz
clean_name <- gsub(" ", "_", target_company)
ggsave(paste0("Report_", clean_name, ".png"), dashboard, width = 14, height = 7, dpi = 300)
# 2. Export Summary CSV for Stakeholders
write_csv(metrics, paste0("Summary_", clean_name, ".csv"))
message(paste(">>> Successfully compiled report for:", target_company))
return(list(metrics = metrics, plot = dashboard))
}
# 4. EXECUTION LOOP (THE AUTO-PILOT) =======================================
unique_companies <- unique(raw_data$company_name)
# Using lapply to iterate through all companies and store results in a list
all_reports <- lapply(unique_companies, function(comp) {
generate_company_report(comp, raw_data)
})
# Display a sample from the first company
print(all_reports[[1]]$plot)Interpretation
1. Salary Structure by Dept (Boxplot) This chart is our ‘Equity Check.’ By using boxplots, we can see the salary range for each department at a glance.
The Boxes: They show where the middle 50% of employees sit. If a box is ‘tall,’ it means there’s a big gap between the lowest and highest paid in that department.
The Red Dots (Outliers): These are the outliers—people getting paid significantly more (or less) than their peers. It’s essential for HR to spot these ‘black sheep’ in the payroll.
2. Performance Correlation (Scatter Plot) This is our ‘Consistency Metric.’ We’re plotting ‘Core Performance’ against ‘KPI Output’ to see if our internal scores actually mean anything.
The Slope: We’re looking for a steep upward line. That tells us the company’s rating system is actually working—better workers are hitting higher targets.
The Density: If the points are scattered everywhere without a clear trend, it’s a red flag. It means the company’s performance metrics might be a bit random and probably need a total rework.
10 Reference
Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media.
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O’Reilly Media.
Grolemund, G., & Wickham, H. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
“R for Data Science” by Hadley Wickham and Garrett Grolemund.
“Python for Data Analysis” by Wes McKinney.
Tufte, E. R. (2001). The Visual Display of Quantitative Information. Graphics Press.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
“Monte Carlo Statistical Methods” by Christian P. Robert and George Casella.
“Introduction to Probability Models” by Sheldon M. Ross.
“Practical Statistics for Data Scientists: 50 Essential Concepts” by Peter Bruce and Andrew Bruce.