Advanced Praticum:Functions,Loops and Data Science Simulation
Praticum Week 5
library(htmltools)
HTML('
<div class="profile-card">
<div>
<img src="JANA.JPEG">
</div>
<div class="profile-text">
<h3>Januaria Teresinha</h3>
<div class="profile-description">
as a Student Data Science in Institut Teknologi Sains Bandung
</div>
<div class="profile-info-row">
<p class="profile-info-item">
<b>Mentored by:</b> <span class="mentor-name">Mr. Bakti Siregar M.Sc.,CDS</span>
</p>
</div>
</div>
</div>
')Januaria Teresinha
Mentored by: Mr. Bakti Siregar M.Sc.,CDS
1 Introduction
This report is a comprehensive compilation of the Data Science Programming practicum. The primary focus of this study is the integration of fundamental programming building blocks—such as modular functions, nested loops, and multi-tier conditional logic—into complex Data Science workflows.
In a professional data environment, a Data Scientist must often build automated systems to process and simulate large-scale data efficiently. This practicum covers seven core implementations:
1.Mathematical Engine: Computing multi-formula growth models.
2.Sales Simulation: Managing transaction data with nested accumulation logic.
3.Performance Classification: Statistical categorization of workforce data.
4.Corporate Synthesis: Generating hierarchical datasets for multiple companies.
5.Monte Carlo Simulation: Estimating the mathematical constant \(\pi\).
6.Data Transformation: Normalization and Z-Score standardization.
7.Executive Dashboard: Advanced visualization and KPI reporting.
8.Automated Reporting: Lastly, we will develop a function that automatically generates individual reports for each company entity in separate file formats.
2 Program 1: Multi-Formula Computational Engine
- Logic Explanation
Validation: The function compute_formula validates user input against a library of allowed models.
Nested Iteration: An outer loop cycles through formula types, while an inner loop calculates values for each \(x\) in the range \([1, 20]\).
Visualization: Uses ggplot2 with a Logarithmic Scale to ensure exponential growth doesn’t overshadow linear trends.
library(ggplot2)
library(tidyr)
library(dplyr)
compute_formula <- function(x_range, formula_list) {
# Daftar formula yang valid
valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
# Validasi input
invalid <- setdiff(formula_list, valid_formulas)
if (length(invalid) > 0) {
stop(paste("Invalid Input:", paste(invalid, collapse = ", ")))
}
# Menghitung nilai y berdasarkan formula
results <- data.frame(x = x_range)
for (f in formula_list) {
results[[f]] <- switch(f,
linear = x_range,
quadratic = x_range^2,
cubic = x_range^3,
exponential = 2^x_range)
}
return(results)
}
# Menghasilkan data
data_results <- compute_formula(1:20, c("linear", "quadratic", "cubic", "exponential"))
data_long <- pivot_longer(data_results, cols = -x, names_to = "Formula", values_to = "y")
# Membuat plot
ggplot(data_long, aes(x = x, y = y, color = Formula)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_y_log10() + # Skala logaritmik untuk sumbu y
labs(title = "Comparison of Growth Models",
subtitle = "Linear, Quadratic, Cubic, and Exponential Growth",
x = "X Value",
y = "Output (Y) on Log Scale") +
theme_minimal() +
scale_color_brewer(palette = "Set1") + # Palette warna yang lebih menarik
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)3 Program 2: Sales Simulation & Nested Accumulation
- Logic Explanation
Business Rules: Discounts are applied based on sales volume (\(>8k=15\%\), \(>5k=10\%\), \(>2k=5\%\))
Encapsulation: Uses a nested function apply_logic to separate data generation from business calculations
Cumulative Tracking: A nested loop identifies each salesperson and calculates their running revenue total.
library(dplyr)
library(ggplot2)
simulate_sales <- function(n_salesperson, days) {
set.seed(123) # Untuk hasil yang konsisten
total_records <- n_salesperson * days
# Membuat dataset penjualan
sales_data <- tibble(
sales_id = 1:total_records,
salesperson = rep(paste("Sales", 1:n_salesperson), each = days),
day = rep(1:days, times = n_salesperson),
sales_amount = round(runif(total_records, 1000, 10000), 2)
)
# Menghitung diskon dan penjualan bersih
sales_data <- sales_data %>%
mutate(
discount_rate = case_when(
sales_amount > 8000 ~ 0.15,
sales_amount > 5000 ~ 0.10,
sales_amount > 2000 ~ 0.05,
TRUE ~ 0
),
net_sales = sales_amount * (1 - discount_rate)
)
# Menghitung penjualan kumulatif per salesperson
sales_data <- sales_data %>%
group_by(salesperson) %>%
mutate(cumulative_sales = cumsum(net_sales)) %>%
ungroup()
return(sales_data)
}
# Menghasilkan data penjualan
sales_final <- simulate_sales(5, 15)
# Membuat visualisasi
ggplot(sales_final, aes(x = day, y = cumulative_sales, color = salesperson)) +
geom_line(linewidth = 1) +
geom_point(size = 3, alpha = 0.7) + # Menambahkan titik untuk kejelasan
labs(
title = "Cumulative Sales Performance Over Time",
subtitle = "Comparison of Salespersons' Performance",
x = "Day",
y = "Cumulative Sales (Net)",
color = "Salesperson"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
scale_color_brewer(palette = "Set1") # Menggunakan palette warna yang menarik4 Program 3: Performance Classification
- Logic Explanation
Classification: Uses a vectorized function to map sales into 5 distinct categories.
Proportional Analysis: Calculates the relative frequency (%) to assess workforce distribution.
library(ggplot2)
library(dplyr)
# Fungsi untuk mengkategorikan performa penjualan
categorize_performance <- function(sales_vec) {
cut(sales_vec,
breaks = c(0, 3000, 5000, 7000, 9000, Inf),
labels = c("Poor", "Average", "Good", "Very Good", "Excellent"),
right = FALSE)
}
# Menghasilkan data penjualan
set.seed(456)
data_sales <- runif(100, 1000, 10000)
# Membuat dataframe distribusi kategori
dist_df <- data.frame(Category = categorize_performance(data_sales)) %>%
count(Category) %>%
mutate(Percentage = (n / sum(n)) * 100)
# Membuat visualisasi
ggplot(dist_df, aes(x = Category, y = n, fill = Category)) +
geom_bar(stat = "identity", color = "black", linewidth = 0.3) + # Menambahkan garis tepi
geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + # Menambahkan label persentase
labs(
title = "Distribution of Sales Performance Categories",
subtitle = "Based on 100 Simulated Sales Data Points",
x = "Performance Category",
y = "Count",
fill = "Category"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
panel.grid.major.x = element_blank(), # Menghilangkan grid horizontal
panel.grid.minor.y = element_blank() # Menghilangkan grid minor vertikal
) +
scale_fill_brewer(palette = "Pastel1") + # Menggunakan palette warna yang menarik
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) # Menyesuaikan sumbu y5 Program 4: Multi-Company Data Synthesis
- Logic Explanation
Hierarchy: Generates data for multiple companies using nested loops.
KPI Logic: Identifies “Top Performers” with KPI scores above 90.
library(ggplot2)
library(dplyr)
# Fungsi untuk menghasilkan data perusahaan
generate_company_data <- function(n_comp, n_emp) {
# Membuat data menggunakan `expand.grid` untuk menghindari nested loops
all_data <- expand.grid(
company_id = paste("Company", LETTERS[1:n_comp]),
employee_id = 1:n_emp
) %>%
mutate(
salary = round(runif(n_comp * n_emp, 5000, 15000), 0),
perf_score = round(runif(n_comp * n_emp, 1, 10), 1),
kpi_score = round(runif(n_comp * n_emp, 50, 100), 0),
status = ifelse(kpi_score > 90, "Top Performer", "Standard")
)
return(all_data)
}
# Menghasilkan data untuk 4 perusahaan dengan 50 karyawan masing-masing
set.seed(123) # Untuk hasil yang konsisten
my_dataset <- generate_company_data(4, 50)
# Membuat visualisasi
ggplot(my_dataset, aes(x = perf_score, y = salary, color = company_id)) +
geom_point(alpha = 0.7, size = 3) + # Menambahkan titik dengan transparansi
geom_smooth(method = "lm", se = FALSE, linewidth = 1, linetype = "dashed") + # Menambahkan garis regresi
labs(
title = "Corporate Performance vs. Salary Analysis",
subtitle = "Comparison of 4 Simulated Companies",
x = "Performance Score (1-10)",
y = "Annual Salary ($)",
color = "Company ID"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank()
) +
scale_color_brewer(palette = "Set1") + # Menggunakan palette warna yang menarik
facet_wrap(~ status, scales = "free") # Membagi plot berdasarkan status performa
company_raw <- generate_company_data(4, 50)
6 Program 5: Monte Carlo Pi Estimation
- Logic Explanation
Stochastic Method: Uses random points \((x,y)\) to calculate the area ratio of a circle inside a square.
Calculation: \(\pi \approx 4 \times (\text{Points Inside} / \text{Total Points})\).
library(ggplot2)
library(dplyr)
# 1. Fungsi yang Disempurnakan
simulate_pi_estimation <- function(n_points) {
set.seed(789) # Untuk hasil yang konsisten
# Generate random points
points <- data.frame(
x = runif(n_points, -1, 1),
y = runif(n_points, -1, 1)
) %>%
mutate(
distance = x^2 + y^2,
inside = distance <= 1,
point_type = ifelse(inside, "Inside Circle", "Outside Circle")
)
# Calculate pi estimate
pi_estimate <- 4 * sum(points$inside) / n_points
# Create circle boundary data
circle_data <- data.frame(
x = cos(seq(0, 2*pi, length.out = 100)),
y = sin(seq(0, 2*pi, length.out = 100))
)
return(list(
pi_estimate = pi_estimate,
points = points,
circle = circle_data,
n_points = n_points
))
}
# 2. Eksekusi Simulasi
simulation_results <- simulate_pi_estimation(5000)
# 3. Visualisasi yang Ditingkatkan
ggplot() +
# Background square
geom_rect(aes(xmin = -1, xmax = 1, ymin = -1, ymax = 1),
fill = "white", color = "black", alpha = 0.1) +
# Points layer
geom_point(
data = simulation_results$points,
aes(x, y, color = point_type),
alpha = 0.6, size = 1.2, shape = 16
) +
# Circle boundary
geom_path(
data = simulation_results$circle,
aes(x, y),
color = "black", linewidth = 1.2, linetype = "solid"
) +
# Coordinate system and styling
coord_fixed() +
scale_color_manual(
values = c("Inside Circle" = "#3498db", "Outside Circle" = "#e74c3c"),
name = "Point Location"
) +
# Labels and annotations
labs(
title = "Monte Carlo Simulation for π Estimation",
subtitle = sprintf(
"Using %d points | Estimated π = %.5f (Error: %.5f)",
simulation_results$n_points,
simulation_results$pi_estimate,
abs(pi - simulation_results$pi_estimate)
),
x = "X Coordinate",
y = "Y Coordinate",
caption = "Theoretical π value = 3.14159"
) +
# Theme customization
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
legend.position = "bottom",
legend.title = element_text(size = 10),
legend.text = element_text(size = 9),
panel.grid = element_line(color = "gray90"),
panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5)
) +
# Additional annotation showing the actual estimate
annotate(
"text", x = 0, y = -1.1,
label = sprintf("Estimate: %.5f", simulation_results$pi_estimate),
size = 5, fontface = "bold", color = "#2c3e50"
)7 Program 6: Data Normalization & Feature Engineering
- Logic Explanation
Standardization: Implements Min-Max Normalization and Z-Score transformations for salary and KPI variables.
Efficiency Metric: Creates a new feature: \(\text{Performance} / \text{Salary}\).
# Load necessary libraries
library(dplyr) # For data manipulation
library(ggplot2) # For data visualization
library(scales) # For better axis formatting
library(gridExtra) # For arranging multiple plots
library(grid) # For textGrob and gpar
## 1. DATA TRANSFORMATION FUNCTIONS ========================================
# Min-Max Normalization (0-1 range)
normalize_columns <- function(data, cols) {
data %>%
mutate(across(all_of(cols),
~ (. - min(., na.rm = TRUE)) /
(max(., na.rm = TRUE) - min(., na.rm = TRUE)),
.names = "{col}_norm"))
}
# Z-Score Standardization (mean=0, sd=1)
standardize_columns <- function(data, cols) {
data %>%
mutate(across(all_of(cols),
~ (. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE),
.names = "{col}_z"))
}
## 2. FEATURE ENGINEERING ==================================================
create_features <- function(data) {
data %>%
mutate(
# Performance categories
performance_level = cut(KPI_score,
breaks = c(0, 50, 75, 90, 100),
labels = c("Low", "Medium", "High", "Elite"),
include.lowest = TRUE),
# Salary brackets using quartiles
salary_bracket = case_when(
salary < quantile(salary, 0.25, na.rm = TRUE) ~ "Low",
salary < quantile(salary, 0.75, na.rm = TRUE) ~ "Medium",
TRUE ~ "High"
),
# Efficiency metric (KPI score per $1M salary)
efficiency = ifelse(salary > 0, KPI_score / (salary / 1e6), NA_real_),
# Outlier detection (Z-score > 2.5)
is_outlier = ifelse(abs(scale(KPI_score)[,1]) > 2.5, "Outlier", "Normal")
) %>%
mutate(across(c(performance_level, salary_bracket, is_outlier), as.factor))
}
## 3. DATA PREPARATION =====================================================
set.seed(123)
employee_data <- data.frame(
employee_id = 1:200,
company = sample(c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon'),
200, replace = TRUE,
prob = c(0.3, 0.25, 0.2, 0.15, 0.1)),
salary = pmax(pmin(rnorm(200, 70000, 15000), 150000), 30000),
performance_raw = rnorm(200, 70, 10),
department = sample(c('Sales', 'Engineering', 'HR', 'Finance', 'Marketing'),
200, replace = TRUE,
prob = c(0.3, 0.25, 0.15, 0.15, 0.15))
) %>%
mutate(
performance = case_when(
department == "Sales" ~ performance_raw * 1.1,
department == "Engineering" ~ performance_raw * 0.9,
TRUE ~ performance_raw
),
KPI_score = pmin(pmax(performance + rnorm(200, 0, 5), 0), 100)
) %>%
select(-performance_raw)
## 4. DATA PROCESSING PIPELINE =============================================
# Perbaikan: Kami tetap menyimpan kolom norm/z untuk visualisasi
# dan tidak menghapus 'efficiency' karena akan diplot nanti.
processed_data <- employee_data %>%
normalize_columns(c("salary", "KPI_score")) %>%
standardize_columns(c("salary", "KPI_score")) %>%
create_features()
## 5. ENHANCED VISUALIZATIONS ==============================================
my_theme <- theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
# 1. Salary Distribution Comparison
salary_plot <- ggplot(processed_data) +
geom_density(aes(x = salary, color = "Original"), linewidth = 1) +
geom_density(aes(x = salary_norm * 100000, color = "Normalized (Scaled)"), linewidth = 1) +
scale_color_manual(name = "Scale",
values = c("Original" = "#E69F00", "Normalized (Scaled)" = "#56B4E9")) +
labs(title = "Salary Distribution", x = "Value", y = "Density") +
my_theme
# 2. Performance by Salary Bracket
performance_plot <- ggplot(processed_data,
aes(x = salary_bracket, y = KPI_score, fill = salary_bracket)) +
geom_boxplot(alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "KPI by Salary Bracket", x = "Salary Bracket", y = "KPI Score") +
my_theme +
theme(legend.position = "none")
# 3. Efficiency Analysis
efficiency_stats <- processed_data %>%
filter(!is.na(efficiency)) %>%
group_by(performance_level) %>%
summarise(
median_eff = median(efficiency, na.rm = TRUE),
mean_eff = mean(efficiency, na.rm = TRUE),
q1 = quantile(efficiency, 0.25, na.rm = TRUE),
q3 = quantile(efficiency, 0.75, na.rm = TRUE)
)
efficiency_plot <- processed_data %>%
filter(!is.na(efficiency)) %>%
ggplot(aes(x = performance_level, y = efficiency, fill = performance_level)) +
geom_violin(alpha = 0.6, width = 0.8) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
geom_point(data = efficiency_stats, aes(y = median_eff), color = "red", shape = 18, size = 2) +
scale_fill_manual(values = c("#FF9AA2", "#FFB7B2", "#FFDAC1", "#E2F0CB")) +
labs(title = "Efficiency Analysis",
subtitle = "KPI Score per $1M Salary",
x = "Performance Level", y = "Efficiency") +
my_theme +
theme(legend.position = "none")
## 6. FINAL DASHBOARD ======================================================
# Mengatur layout: Salary dan Performance di atas, Efficiency di bawah (lebar)
grid.arrange(
salary_plot,
performance_plot,
efficiency_plot,
layout_matrix = rbind(c(1, 2), c(3, 3)),
top = textGrob("Employee Performance Analysis Dashboard",
gp = gpar(fontsize = 16, fontface = "bold"))
)- Conclusion
practicum successfully demonstrated the versatility of R in managing end-to-end data tasks. By utilizing stochastic simulations, multi-tiered logic, and advanced visualization, we have built a solid foundation for professional data science modeling and automated reporting systems.
8 7 Program 7 : Integrated Corporate KPI Dashboard
Focus: Workforce Analytics, Stochastic Simulation, and Performance Correlation
Tooling: R Programming (tidyverse, ggplot2)
This is the finalized, professional English version of your Mini Project: Company KPI Dashboard. I have structured it to match the high-standard academic format of your previous sections, ensuring all technical logic is explained clearly with a clean, “report-ready” layout.
Practicum Project: Integrated Corporate KPI Dashboard
Focus: Workforce Analytics, Stochastic Simulation, and Performance Correlation
Tooling: R Programming (tidyverse, ggplot2)
1. Project Overview
The Company KPI Dashboard serves as a high-level simulation designed to evaluate organizational health across multiple corporate entities. By generating synthetic employee data, this project aims to:
Correlate Incentives: Determine if higher salaries align with superior performance.
Identify Excellence: Segment the workforce to isolate “Elite” performers.
Benchmarking: Compare departmental efficiency (Sales, IT, HR, etc.) to identify operational strengths and weaknesses.
2. Data Requirements & Simulation Logic
To ensure a realistic simulation, each observation in the dataset contains the following attributes:
Identifier: Unique Employee ID and Company ID.
Financials: Annual Salary (Scaled between 30 and 150 million).
Metrics: Performance and KPI scores generated using a normal distribution (\(N(\mu, \sigma^2)\)) to mimic real-world bell-curve distributions.
Categorization: Departmental assignments and Tier classification.
3. Implementation Logic
3.1 Data Synthesis & Stochastic Modeling
We use set.seed(123) for reproducibility. The salary and performance scores are generated using rnorm, then “clamped” using pmin and pmax to stay within logical boundaries (e.g., scores cannot exceed 100).
3.2 KPI Tier Classification
The workforce is segmented into four distinct tiers based on their KPI results using the cut() function:
Elite: 90 – 100
High: 75 – 89
Medium: 50 – 74
Low: Below 50
# --- 1. Environment Setup ---
library(dplyr)
library(ggplot2)
library(scales) # Untuk formatting angka
# --- 2. Synthetic Data Generation ---
set.seed(123)
companies <- c('Company_A', 'Company_B', 'Company_C', 'Company_D', 'Company_E')
departments <- c('Sales', 'IT', 'HR', 'Finance', 'Marketing')
# Generating 200 employee records
df <- data.frame(
employee_id = 1:200,
company_id = sample(companies, 200, replace = TRUE),
salary = pmax(pmin(rnorm(200, 70, 15), 150), 30) * 1e6, # Salary in millions
performance_score = pmax(pmin(rnorm(200, 70, 15), 100), 0),
department = sample(departments, 200, replace = TRUE)
)
# Derived Logic: KPI score is 70% of performance + 30% random noise
df$KPI_score <- pmax(pmin(df$performance_score * 0.7 + rnorm(200, 20, 5), 100), 0)
# --- 3. Categorization Logic ---
df$KPI_tier <- cut(df$KPI_score,
breaks = c(0, 50, 75, 90, 100),
labels = c('Low', 'Medium', 'High', 'Elite'),
include.lowest = TRUE)
# --- 4. Dashboard Visualization ---
ggplot(df, aes(x = KPI_score, y = salary / 1e6, color = department)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = 'lm', se = FALSE, color = 'black', linetype = "dashed", linewidth = 1) +
labs(
title = 'Relationship Between Salary and KPI',
subtitle = 'Visualization of Performance Correlation across 5 Companies',
x = 'KPI Score (0-100)',
y = 'Annual Salary (Millions)',
color = 'Department'
) +
scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "M")) + # Format salary
scale_color_brewer(palette = "Set2") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = 'bold', hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray50"),
legend.position = "bottom",
legend.title = element_text(face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10)
)5. Results & Analysis
5.1 Performance Correlation
The Scatter Plot reveals the relationship between financial compensation and output. The black trendline (Linear Model) illustrates the “Pay-for-Performance” slope. If the line slants upward, it suggests that higher salaries are effectively incentivizing better KPI scores.
5.2 Departmental Distribution
By observing the color-coded points, we can identify which departments (e.g., IT vs Sales) cluster in the “Elite” tier. This visualization allows HR managers to quickly spot which departments are over-performing relative to their average salary.
9 8. Program 8: Automated Corporate Reporting Engine
8.1 Logic & Automation Workflow
The objective of this module is to eliminate manual reporting by using a Functional Loop to filter data and export individual summaries for each company in the dataset.
Iterative Filtering: The program identifies all unique company_id values. It then loops through this list, creating a subset of data for one company at a time.
Metric Calculation: For every company, the engine calculates internal KPIs such as “Average Salary,” “Top Performer Count,” and “Departmental Strength.”
Multi-Format Export: To simulate a real-world workflow, the program is designed to save these summaries as CSV files (for data teams) and print Formatted Summaries (for management).
8.2 R Implementation (Automation Engine)
# --- 1. Automated Reporting Function ---
generate_company_reports <- function(master_df) {
# Load library di dalam fungsi untuk memastikan dependensi tersedia
if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)
library(dplyr)
# Validasi: Cek apakah kolom yang dibutuhkan ada
required_cols <- c("company_id", "KPI_score", "KPI_tier")
if (!all(required_cols %in% colnames(master_df))) {
stop("Error: Kolom yang dibutuhkan tidak ditemukan dalam dataframe!")
}
# Step 1: Membuat folder output agar file rapi
output_dir <- "Company_Reports"
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Step 2: Ambil daftar unik perusahaan
unique_companies <- unique(master_df$company_id)
cat("==========================================
")
cat(" STARTING AUTOMATED REPORT GENERATION
")
cat("==========================================
")
# Step 3: Loop melalui setiap perusahaan
for (comp in unique_companies) {
# Filter data spesifik perusahaan
comp_data <- master_df %>% filter(company_id == comp)
# Perhitungan Metrik Lokal
total_emp <- nrow(comp_data)
avg_kpi <- mean(comp_data$KPI_score, na.rm = TRUE)
top_performers <- sum(comp_data$KPI_tier == "Elite", na.rm = TRUE)
# Pembuatan Nama File (Cleaning: spasi jadi underscore, lowercase)
clean_name <- gsub(" ", "_", tolower(comp))
file_path <- file.path(output_dir, paste0(clean_name, "_Summary.csv"))
# Ekspor Data ke CSV (Real Export)
write.csv(comp_data, file_path, row.names = FALSE)
# Cetak Summary ke Konsol untuk Verifikasi
cat(paste0("[PROCESSING]: ", comp, "
"))
cat(paste0(" - Workforce Size : ", total_emp, "
"))
cat(paste0(" - Mean KPI Score : ", round(avg_kpi, 2), "
"))
cat(paste0(" - Elite Talent : ", top_performers, "
"))
cat(paste0(" - File Saved to : ", file_path, "
"))
cat("------------------------------------------
")
}
cat("SUCCESS: All reports generated in '", output_dir, "' folder.
")
}
# --- 2. Eksekusi ---
# Menjalankan fungsi menggunakan dataframe 'df' yang sudah kita buat sebelumnya
generate_company_reports(df)## ==========================================
## STARTING AUTOMATED REPORT GENERATION
## ==========================================
## [PROCESSING]: Company_C
## - Workforce Size : 37
## - Mean KPI Score : 66.92
## - Elite Talent : 1
## - File Saved to : Company_Reports/company_c_Summary.csv
## ------------------------------------------
## [PROCESSING]: Company_B
## - Workforce Size : 40
## - Mean KPI Score : 68.64
## - Elite Talent : 0
## - File Saved to : Company_Reports/company_b_Summary.csv
## ------------------------------------------
## [PROCESSING]: Company_E
## - Workforce Size : 43
## - Mean KPI Score : 67.64
## - Elite Talent : 1
## - File Saved to : Company_Reports/company_e_Summary.csv
## ------------------------------------------
## [PROCESSING]: Company_D
## - Workforce Size : 34
## - Mean KPI Score : 67.03
## - Elite Talent : 0
## - File Saved to : Company_Reports/company_d_Summary.csv
## ------------------------------------------
## [PROCESSING]: Company_A
## - Workforce Size : 46
## - Mean KPI Score : 69.45
## - Elite Talent : 1
## - File Saved to : Company_Reports/company_a_Summary.csv
## ------------------------------------------
## SUCCESS: All reports generated in ' Company_Reports ' folder.
9. Summary of Findings & Conclusion
Through these eight programming implementations, we have transitioned from basic syntax to a full-scale Data Science pipeline.
Key Takeaways:
1.Efficiency: Using loops and functions reduces code redundancy and allows for scaling from 1 company to 1,000 with zero extra manual effort.
2.Stochastic Power: Monte Carlo simulations prove that randomness can be a tool for mathematical precision.
3.Visualization: Data is only as good as it is communicated; using ggplot2 ensures that complex standardizations (like Z-Scores) are intuitive for non-technical stakeholders.
Final Academic Reflection
This practicum confirms that the core of Data Science Programming is not just writing code, but architecting logic that can adapt to changing data environments. By mastering these automated workflows, we are prepared to handle high-velocity data challenges in a professional setting.