Syntax and Control Flow
Practicum ~ Week 4
- Introduction
This report is prepared to fulfill the Advanced Practicum requirements for the Data Science Programming course under the guidance of Bakti Siregar, M.Sc. The primary objective of this practicum is to develop an automated data science workflow by integrating multi-layer functions, nested loops, and complex conditional logic.The tasks within this practicum simulate real-world data science challenges, ranging from dynamic formula computations and Monte Carlo simulations to multi-company KPI analysis. By focusing on advanced statistics, data transformation, and visualization, this report demonstrates the practical application of R and Python in solving sophisticated analytical problems.
1 Dynamic Multi-Formula Function
1.1 Implementation
# =========================
# LIBRARY
# =========================
library(ggplot2)
library(tidyr)
library(plotly)
# =========================
# FUNCTION
# =========================
compute_formula <- function(x, formulas) {
results <- list()
for (f in formulas) {
y <- numeric(length(x))
for (i in seq_along(x)) {
if (f == "linear") {
y[i] <- x[i]
} else if (f == "quadratic") {
y[i] <- x[i]^2
} else if (f == "cubic") {
y[i] <- x[i]^3
} else if (f == "exponential") {
y[i] <- exp(x[i] / 5)
} else {
stop(paste("Formula tidak valid:", f))
}
}
results[[f]] <- y
}
return(as.data.frame(results))
}
# =========================
# INPUT
# =========================
x_values <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")
# =========================
# RUN FUNCTION
# =========================
df <- compute_formula(x_values, formulas)
df$x <- x_values
# =========================
# TRANSFORM
# =========================
df_long <- pivot_longer(df,
cols = -x,
names_to = "formula",
values_to = "y")
# =========================
# PLOT
# =========================
p <- ggplot(
df_long,
aes(
x = x,
y = y,
color = formula,
text = paste0(
"x: ", x,
"<br>y: ", round(y,2),
"<br>Formula: ", formula
)
)
) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Dynamic Multi-Formula Plot",
subtitle = "Linear, Quadratic, Cubic, Exponential",
x = "X Value",
y = "Y Value"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "top"
)
ggplotly(p, tooltip = "text") %>%
layout(hovermode = "x unified")1.2 Interpretation
This implementation demonstrates how different mathematical models behave across the same range of input values using nested loops and conditional logic.
The visualization shows that each formula has a distinct growth pattern. The linear function increases at a constant rate, while the quadratic and cubic functions grow progressively faster as the value of x increases. The exponential function exhibits the most rapid growth, especially at higher values of x, highlighting its sensitivity to change.
Overall, the comparison clearly illustrates how higher-order and exponential functions can lead to significantly larger outputs, which is important in understanding model selection and data behavior in real-world applications.
2 Nested Simulation – Multi-Sales & Discounts
2.1 Implementation
library(dplyr)
library(plotly)
library(knitr)
# Fungsi simulasi (tetap sama)
simulate_sales <- function(n_salesperson, days) {
all_data <- data.frame()
for (sp in 1:n_salesperson) {
cumulative_sales <- 0
for (d in 1:days) {
sales_amount <- sample(100:1000, 1)
if (sales_amount > 800) {
discount_rate <- 0.20
} else if (sales_amount > 500) {
discount_rate <- 0.10
} else {
discount_rate <- 0.05
}
cumulative_sales <- cumulative_sales + sales_amount
temp <- data.frame(
salesperson = paste0("SP", sp),
day = d,
sales_amount = sales_amount,
discount_rate = discount_rate,
cumulative_sales = cumulative_sales
)
all_data <- rbind(all_data, temp)
}
}
return(all_data)
}
# Jalankan simulasi
set.seed(123)
data_sales <- simulate_sales(3, 10)
cat(" Table 1: Sales Data\n")## Table 1: Sales Data
| salesperson | day | sales_amount | discount_rate | cumulative_sales |
|---|---|---|---|---|
| SP1 | 1 | 514 | 0.10 | 514 |
| SP1 | 2 | 562 | 0.10 | 1076 |
| SP1 | 3 | 278 | 0.05 | 1354 |
| SP1 | 4 | 625 | 0.10 | 1979 |
| SP1 | 5 | 294 | 0.05 | 2273 |
| SP1 | 6 | 917 | 0.20 | 3190 |
| SP1 | 7 | 217 | 0.05 | 3407 |
| SP1 | 8 | 398 | 0.05 | 3805 |
| SP1 | 9 | 328 | 0.05 | 4133 |
| SP1 | 10 | 343 | 0.05 | 4476 |
| SP2 | 1 | 113 | 0.05 | 113 |
| SP2 | 2 | 473 | 0.05 | 586 |
| SP2 | 3 | 764 | 0.10 | 1350 |
| SP2 | 4 | 701 | 0.10 | 2051 |
| SP2 | 5 | 702 | 0.10 | 2753 |
| SP2 | 6 | 867 | 0.20 | 3620 |
| SP2 | 7 | 808 | 0.20 | 4428 |
| SP2 | 8 | 190 | 0.05 | 4618 |
| SP2 | 9 | 447 | 0.05 | 5065 |
| SP2 | 10 | 748 | 0.10 | 5813 |
| SP3 | 1 | 454 | 0.05 | 454 |
| SP3 | 2 | 939 | 0.20 | 1393 |
| SP3 | 3 | 125 | 0.05 | 1518 |
| SP3 | 4 | 618 | 0.10 | 2136 |
| SP3 | 5 | 525 | 0.10 | 2661 |
| SP3 | 6 | 748 | 0.10 | 3409 |
| SP3 | 7 | 865 | 0.20 | 4274 |
| SP3 | 8 | 310 | 0.05 | 4584 |
| SP3 | 9 | 689 | 0.10 | 5273 |
| SP3 | 10 | 692 | 0.10 | 5965 |
summary_stats <- data_sales %>%
group_by(salesperson) %>%
summarise(
total_sales = sum(sales_amount),
mean_sales = mean(sales_amount)
)
cat(" Table 2: Summary Statistics\n")## Table 2: Summary Statistics
| salesperson | total_sales | mean_sales |
|---|---|---|
| SP1 | 4476 | 447.6 |
| SP2 | 5813 | 581.3 |
| SP3 | 5965 | 596.5 |
# =========================
# 📈 PLOTLY VISUALIZATION
# =========================
fig <- plot_ly(data_sales,
x = ~day,
y = ~cumulative_sales,
color = ~salesperson,
type = 'scatter',
mode = 'lines+markers')
fig <- fig %>%
layout(title = "Cumulative Sales per Salesperson",
xaxis = list(title = "Day"),
yaxis = list(title = "Cumulative Sales"))
fig2.2 Interpretation
This implementation simulates daily sales activity for multiple salespersons over a given period using a structured approach with functions, loops, and conditionals.
The simulate_sales function generates random sales amounts for each salesperson across several days. A conditional logic is applied to assign discount rates based on the sales value, reflecting real-world business rules. The use of nested loops allows the model to iterate through each salesperson and each day systematically.
Cumulative sales are calculated progressively, enabling tracking of overall performance over time. The resulting dataset is then summarized to show total and average sales per salesperson, providing a clear comparison of performance.
Finally, the interactive Plotly visualization helps illustrate how cumulative sales grow over time for each salesperson, making it easier to identify trends and differences in sales performance.
3 3. Multi-Level Performance Categorization
3.1 Implementation
library(dplyr)
library(plotly)
library(knitr)
# =========================
# DATA SIMULATION
# =========================
set.seed(123)
sales_amount <- sample(100:1000, 30)
# =========================
# FUNCTION: Categorize Performance
# =========================
categorize_performance <- function(sales_amount) {
categories <- c()
for (s in sales_amount) {
if (s >= 900) {
categories <- c(categories, "Excellent")
} else if (s >= 700) {
categories <- c(categories, "Very Good")
} else if (s >= 500) {
categories <- c(categories, "Good")
} else if (s >= 300) {
categories <- c(categories, "Average")
} else {
categories <- c(categories, "Poor")
}
}
return(categories)
}
# =========================
# APPLY FUNCTION
# =========================
performance <- categorize_performance(sales_amount)
data_perf <- data.frame(
sales_amount = sales_amount,
category = performance
)
# Tambah ID biar rapi
data_perf <- data_perf %>%
mutate(id = row_number()) %>%
select(id, everything())
# =========================
# 📋 TABLE: IMPLEMENTATION RESULT
# =========================
cat("### Table 1: Sales Performance Categorization\n")## ### Table 1: Sales Performance Categorization
| id | sales_amount | category |
|---|---|---|
| 1 | 514 | Good |
| 2 | 562 | Good |
| 3 | 278 | Poor |
| 4 | 625 | Good |
| 5 | 294 | Poor |
| 6 | 917 | Excellent |
| 7 | 217 | Poor |
| 8 | 398 | Average |
| 9 | 328 | Average |
| 10 | 343 | Average |
| 11 | 113 | Poor |
| 12 | 473 | Average |
| 13 | 764 | Very Good |
| 14 | 701 | Very Good |
| 15 | 702 | Very Good |
| 16 | 867 | Very Good |
| 17 | 808 | Very Good |
| 18 | 190 | Poor |
| 19 | 447 | Average |
| 20 | 748 | Very Good |
| 21 | 454 | Average |
| 22 | 939 | Excellent |
| 23 | 125 | Poor |
| 24 | 618 | Good |
| 25 | 525 | Good |
| 26 | 981 | Excellent |
| 27 | 865 | Very Good |
| 28 | 310 | Average |
| 29 | 689 | Good |
| 30 | 692 | Good |
# =========================
# 📊 SUMMARY TABLE
# =========================
summary_perf <- data_perf %>%
group_by(category) %>%
summarise(count = n()) %>%
mutate(percentage = round((count / sum(count)) * 100, 2))
cat("\n### Table 2: Summary Statistics\n")##
## ### Table 2: Summary Statistics
| category | count | percentage |
|---|---|---|
| Average | 7 | 23.33 |
| Excellent | 3 | 10.00 |
| Good | 7 | 23.33 |
| Poor | 6 | 20.00 |
| Very Good | 7 | 23.33 |
# =========================
# 📈 BAR PLOT (Plotly)
# =========================
bar_plot <- plot_ly(summary_perf,
x = ~category,
y = ~count,
type = "bar")
bar_plot <- bar_plot %>%
layout(title = "Performance Distribution (Bar Plot)",
xaxis = list(title = "Category"),
yaxis = list(title = "Count"))
bar_plot3.2 Interpretation
This implementation categorizes sales performance into five levels: Excellent, Very Good, Good, Average, and Poor based on sales amount thresholds.
A loop is used to assign each sales value into a category using conditional logic, simulating a real-world evaluation system. The results show the distribution of performance levels, both in counts and percentages.
The bar plot highlights the number of occurrences in each category, while the pie chart provides a clear view of their proportional distribution. This helps identify which performance level dominates and supports decision-making in evaluating overall sales performance.
4 Multi-Company Dataset Simulation
4.1 Implementation
library(dplyr)
library(plotly)
library(knitr)
# =========================
# FUNCTION: Generate Company Data
# =========================
generate_company_data <- function(n_company, n_employees) {
all_data <- data.frame()
departments <- c("HR", "Finance", "IT", "Marketing")
for (c in 1:n_company) {
for (e in 1:n_employees) {
salary <- sample(3000:10000, 1)
performance_score <- sample(60:100, 1)
KPI_score <- sample(70:100, 1)
department <- sample(departments, 1)
# Conditional: Top Performer
if (KPI_score > 90) {
category <- "Top Performer"
} else {
category <- "Regular"
}
temp <- data.frame(
company_id = paste0("C", c),
employee_id = paste0("E", e),
department = department,
salary = salary,
performance_score = performance_score,
KPI_score = KPI_score,
category = category
)
all_data <- rbind(all_data, temp)
}
}
return(all_data)
}
# =========================
# GENERATE DATA
# =========================
set.seed(123)
company_data <- generate_company_data(3, 10)
# =========================
# 📋 TABLE 1: FULL DATA
# =========================
cat("### Table 1: Company Employee Data\n")## ### Table 1: Company Employee Data
| company_id | employee_id | department | salary | performance_score | KPI_score | category |
|---|---|---|---|---|---|---|
| C1 | E1 | Finance | 5462 | 74 | 88 | Regular |
| C1 | E2 | Finance | 7290 | 96 | 89 | Regular |
| C1 | E3 | IT | 6445 | 84 | 95 | Top Performer |
| C1 | E4 | Marketing | 5756 | 86 | 94 | Top Performer |
| C1 | E5 | IT | 4016 | 68 | 98 | Top Performer |
| C1 | E6 | Finance | 5887 | 85 | 76 | Regular |
| C1 | E7 | Finance | 8768 | 78 | 73 | Regular |
| C1 | E8 | Marketing | 9736 | 98 | 90 | Regular |
| C1 | E9 | HR | 4166 | 91 | 79 | Regular |
| C1 | E10 | Finance | 4798 | 68 | 78 | Regular |
| C2 | E1 | HR | 4046 | 86 | 97 | Top Performer |
| C2 | E2 | HR | 6206 | 86 | 75 | Regular |
| C2 | E3 | Marketing | 4313 | 88 | 74 | Regular |
| C2 | E4 | HR | 3587 | 72 | 87 | Regular |
| C2 | E5 | Finance | 7088 | 86 | 94 | Top Performer |
| C2 | E6 | IT | 3276 | 74 | 78 | Regular |
| C2 | E7 | Marketing | 9233 | 90 | 85 | Regular |
| C2 | E8 | Finance | 5821 | 67 | 91 | Top Performer |
| C2 | E9 | Finance | 4182 | 76 | 91 | Top Performer |
| C2 | E10 | HR | 9128 | 93 | 73 | Regular |
| C3 | E1 | Finance | 5116 | 84 | 89 | Regular |
| C3 | E2 | HR | 8208 | 91 | 83 | Regular |
| C3 | E3 | Finance | 5338 | 99 | 85 | Regular |
| C3 | E4 | Finance | 6979 | 90 | 94 | Top Performer |
| C3 | E5 | HR | 6229 | 94 | 83 | Regular |
| C3 | E6 | IT | 7575 | 66 | 72 | Regular |
| C3 | E7 | HR | 4913 | 74 | 90 | Regular |
| C3 | E8 | Finance | 4074 | 69 | 87 | Regular |
| C3 | E9 | Finance | 5283 | 93 | 79 | Regular |
| C3 | E10 | Finance | 7222 | 71 | 89 | Regular |
# =========================
# 📊 SUMMARY PER COMPANY
# =========================
summary_company <- company_data %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_performance = mean(performance_score),
max_KPI = max(KPI_score)
)
cat("\n### Table 2: Company Summary\n")##
## ### Table 2: Company Summary
| company_id | avg_salary | avg_performance | max_KPI |
|---|---|---|---|
| C1 | 6232.4 | 82.8 | 98 |
| C2 | 5688.0 | 81.8 | 97 |
| C3 | 6093.7 | 83.1 | 94 |
# =========================
# 📈 PLOT 1: AVG SALARY
# =========================
plot_salary <- plot_ly(summary_company,
x = ~company_id,
y = ~avg_salary,
type = "bar")
plot_salary <- plot_salary %>%
layout(title = "Average Salary per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Average Salary"))
plot_salary# =========================
# 📈 PLOT 2: AVG PERFORMANCE
# =========================
plot_perf <- plot_ly(summary_company,
x = ~company_id,
y = ~avg_performance,
type = "bar")
plot_perf <- plot_perf %>%
layout(title = "Average Performance per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Performance Score"))
plot_perf# =========================
# 🥧 PIE CHART: CATEGORY DISTRIBUTION
# =========================
category_dist <- company_data %>%
group_by(category) %>%
summarise(count = n()) %>%
mutate(percentage = round(count/sum(count)*100,2))
pie_chart <- plot_ly(category_dist,
labels = ~category,
values = ~percentage,
type = "pie")
pie_chart <- pie_chart %>%
layout(title = "Employee Category Distribution")
pie_chart4.2 Interpretation
This implementation simulates employee data across multiple companies using nested loops to represent companies and their employees. Each employee is assigned attributes such as salary, department, performance score, and KPI score.
A conditional rule is applied to classify employees as “Top Performer” when their KPI score exceeds 90, reflecting performance evaluation in real-world organizations.
The summary table provides key insights per company, including average salary, average performance, and maximum KPI score. The visualizations help compare company performance and highlight the distribution of top-performing employees.
5 Monte Carlo Simulation (Pi & Probability)
5.1 Implementation
library(plotly)
library(dplyr)
library(knitr)
# =========================
# FUNCTION: Monte Carlo Pi
# =========================
monte_carlo_pi <- function(n_points) {
x_vals <- c()
y_vals <- c()
inside <- c()
count_inside <- 0
count_square <- 0
for (i in 1:n_points) {
# Generate random point
x <- runif(1, -1, 1)
y <- runif(1, -1, 1)
x_vals <- c(x_vals, x)
y_vals <- c(y_vals, y)
# Check inside circle
if (x^2 + y^2 <= 1) {
inside <- c(inside, "Inside Circle")
count_inside <- count_inside + 1
} else {
inside <- c(inside, "Outside Circle")
}
# Probability: sub-square (-0.5 to 0.5)
if (x >= -0.5 && x <= 0.5 && y >= -0.5 && y <= 0.5) {
count_square <- count_square + 1
}
}
# Estimate Pi
pi_estimate <- 4 * (count_inside / n_points)
# Probability result
prob_square <- count_square / n_points
# Data frame
data <- data.frame(
x = x_vals,
y = y_vals,
status = inside
)
return(list(
data = data,
pi_estimate = pi_estimate,
prob_square = prob_square
))
}
set.seed(123)
result <- monte_carlo_pi(1000)
data_mc <- result$data
# =========================
# 📋 TABLE RESULT
# =========================
cat(" Table: Monte Carlo Sample Points\n")## Table: Monte Carlo Sample Points
| x | y | status |
|---|---|---|
| -0.4248450 | 0.5766103 | Inside Circle |
| -0.1820462 | 0.7660348 | Inside Circle |
| 0.8809346 | -0.9088870 | Outside Circle |
| 0.0562110 | 0.7848381 | Inside Circle |
| 0.1028700 | -0.0867705 | Inside Circle |
| 0.9136667 | -0.0933317 | Inside Circle |
| 0.3551413 | 0.1452668 | Inside Circle |
| -0.7941506 | 0.7996499 | Outside Circle |
| -0.5078245 | -0.9158809 | Outside Circle |
| -0.3441586 | 0.9090073 | Inside Circle |
| 0.7790786 | 0.3856068 | Inside Circle |
| 0.2810136 | 0.9885396 | Outside Circle |
| 0.3114116 | 0.4170609 | Inside Circle |
| 0.0881320 | 0.1882840 | Inside Circle |
| -0.4216805 | -0.7057727 | Inside Circle |
| 0.9260485 | 0.8045981 | Outside Circle |
| 0.3814106 | 0.5909348 | Inside Circle |
| -0.9507726 | -0.0444081 | Inside Circle |
| 0.5169191 | -0.5671841 | Inside Circle |
| -0.3636380 | -0.5367484 | Inside Circle |
## Estimated Pi: 3.16
## Probability (point in sub-square): 0.252
summary_mc <- data_mc %>%
group_by(status) %>%
summarise(count = n()) %>%
mutate(percentage = round(count/sum(count)*100,2))
kable(summary_mc)| status | count | percentage |
|---|---|---|
| Inside Circle | 790 | 79 |
| Outside Circle | 210 | 21 |
5.2 Interpretation
This simulation uses the Monte Carlo method to estimate the value of π by generating random points within a square and checking how many fall inside a unit circle. The ratio of points inside the circle to total points is used to approximate π.
Additionally, the simulation computes the probability of points falling within a smaller sub-square, demonstrating probability estimation through random sampling.
The scatter plot visualizes the distribution of points, clearly distinguishing those inside and outside the circle. As the number of points increases, the estimation of π becomes more accurate, reflecting the law of large numbers.
6 Advanced Data Transformation & Feature Engineering
6.1 Implementation
library(dplyr)
library(plotly)
library(knitr)
# =========================
# SAMPLE DATA
# =========================
set.seed(123)
df <- data.frame(
salary = sample(3000:10000, 30),
performance_score = sample(60:100, 30)
)
# =========================
# FUNCTION: NORMALIZATION (Min-Max)
# =========================
normalize_columns <- function(df) {
df_norm <- df
for (col in names(df)) {
if (is.numeric(df[[col]])) {
min_val <- min(df[[col]])
max_val <- max(df[[col]])
df_norm[[col]] <- (df[[col]] - min_val) / (max_val - min_val)
}
}
return(df_norm)
}
# =========================
# FUNCTION: Z-SCORE
# =========================
z_score <- function(df) {
df_z <- df
for (col in names(df)) {
if (is.numeric(df[[col]])) {
mean_val <- mean(df[[col]])
sd_val <- sd(df[[col]])
df_z[[col]] <- (df[[col]] - mean_val) / sd_val
}
}
return(df_z)
}
# =========================
# APPLY TRANSFORMATION
# =========================
df_norm <- normalize_columns(df)
df_z <- z_score(df)
# =========================
# FEATURE ENGINEERING
# =========================
df_feat <- df %>%
mutate(
performance_category = case_when(
performance_score >= 90 ~ "Excellent",
performance_score >= 80 ~ "Very Good",
performance_score >= 70 ~ "Good",
performance_score >= 65 ~ "Average",
TRUE ~ "Poor"
),
salary_bracket = case_when(
salary >= 8000 ~ "High",
salary >= 5000 ~ "Medium",
TRUE ~ "Low"
)
)
# =========================
# 📋 TABLE
# =========================
cat("### Table: Original Data with New Features\n")## ### Table: Original Data with New Features
| salary | performance_score | performance_category | salary_bracket |
|---|---|---|---|
| 5462 | 78 | Good | Medium |
| 5510 | 95 | Excellent | Medium |
| 5226 | 73 | Good | Medium |
| 3525 | 76 | Good | Low |
| 7290 | 71 | Good | Medium |
| 5985 | 74 | Good | Medium |
| 4841 | 91 | Excellent | Low |
| 4141 | 66 | Average | Low |
| 6370 | 68 | Average | Medium |
| 8348 | 92 | Excellent | High |
| 8363 | 69 | Average | High |
| 8133 | 82 | Very Good | High |
| 6445 | 86 | Very Good | Medium |
| 7760 | 87 | Very Good | Medium |
| 9745 | 80 | Very Good | High |
| 4626 | 93 | Excellent | Low |
| 5756 | 88 | Very Good | Medium |
| 8106 | 65 | Average | High |
| 8210 | 61 | Poor | High |
| 3952 | 64 | Poor | Low |
| 7443 | 67 | Average | Medium |
| 4016 | 96 | Excellent | Low |
| 5012 | 72 | Good | Medium |
| 8474 | 77 | Good | High |
| 5887 | 60 | Poor | Medium |
| 9169 | 94 | Excellent | High |
| 5566 | 70 | Good | Medium |
| 4449 | 75 | Good | Low |
| 8768 | 83 | Very Good | High |
| 4789 | 81 | Very Good | Low |
# =========================
# 📊 COMPARISON DATA
# =========================
compare_df <- data.frame(
original_salary = df$salary,
normalized_salary = df_norm$salary,
zscore_salary = df_z$salary
)
# =========================
# 📈 HISTOGRAM (Plotly)
# =========================
hist_plot <- plot_ly(compare_df, x = ~original_salary, type = "histogram", name = "Original") %>%
add_trace(x = ~normalized_salary, name = "Normalized") %>%
add_trace(x = ~zscore_salary, name = "Z-Score") %>%
layout(title = "Salary Distribution Comparison")
hist_plot# =========================
# BOXPLOT
# =========================
library(tidyr)
compare_long <- compare_df %>%
pivot_longer(cols = everything(),
names_to = "type",
values_to = "value")
box_plot <- plot_ly(compare_long,
x = ~type,
y = ~value,
type = "box")
box_plot <- box_plot %>%
layout(title = "Boxplot Comparison (Original vs Normalized vs Z-Score)",
xaxis = list(title = "Data Type"),
yaxis = list(title = "Value"))
box_plot6.2 Interpretation
This implementation applies advanced data transformation techniques, including normalization and z-score standardization, using loop-based functions. These methods rescale the data to make features comparable and suitable for analysis.
Additionally, new features are created to categorize performance and salary levels, enhancing the dataset with meaningful groupings. This reflects real-world feature engineering practices used in data science.
The visualizations compare the distribution of original and transformed data. Histograms show how the scale changes, while boxplots highlight differences in spread and outliers. Overall, the transformations improve data interpretability and prepare it for further analysis or modeling.
7 Mini Project – Company KPI Dashboard & Simulation
7.1 Implementation
library(dplyr)
library(plotly)
library(knitr)
# =========================
# FUNCTION: GENERATE DATA
# =========================
generate_kpi_data <- function(n_company = 5, min_emp = 50, max_emp = 100) {
all_data <- data.frame()
departments <- c("HR", "Finance", "IT", "Marketing", "Operations")
for (c in 1:n_company) {
n_employees <- sample(min_emp:max_emp, 1)
for (e in 1:n_employees) {
salary <- sample(3000:12000, 1)
performance_score <- sample(60:100, 1)
KPI_score <- sample(70:100, 1)
department <- sample(departments, 1)
temp <- data.frame(
employee_id = paste0("E", c, "_", e),
company_id = paste0("C", c),
salary = salary,
performance_score = performance_score,
KPI_score = KPI_score,
department = department
)
all_data <- rbind(all_data, temp)
}
}
return(all_data)
}
# =========================
# GENERATE DATA
# =========================
set.seed(123)
df <- generate_kpi_data(5, 50, 100)
# =========================
# KPI TIER (LOOP)
# =========================
kpi_tier <- c()
for (k in df$KPI_score) {
if (k >= 90) {
kpi_tier <- c(kpi_tier, "Top Performer")
} else if (k >= 80) {
kpi_tier <- c(kpi_tier, "High")
} else if (k >= 70) {
kpi_tier <- c(kpi_tier, "Medium")
} else {
kpi_tier <- c(kpi_tier, "Low")
}
}
df$kpi_tier <- kpi_tier
# =========================
# 📋 TABLE: SAMPLE DATA
# =========================
cat(" Table 1: Sample Employee Data\n")## Table 1: Sample Employee Data
| employee_id | company_id | salary | performance_score | KPI_score | department | kpi_tier |
|---|---|---|---|---|---|---|
| E1_1 | C1 | 5510 | 73 | 72 | Finance | Medium |
| E1_2 | C1 | 4841 | 96 | 89 | HR | High |
| E1_3 | C1 | 9745 | 86 | 74 | IT | Medium |
| E1_4 | C1 | 5887 | 85 | 76 | Finance | Medium |
| E1_5 | C1 | 5979 | 73 | 86 | IT | High |
| E1_6 | C1 | 7468 | 71 | 84 | Finance | High |
| E1_7 | C1 | 10788 | 66 | 78 | HR | Medium |
| E1_8 | C1 | 4046 | 86 | 97 | Operations | Top Performer |
| E1_9 | C1 | 6206 | 86 | 75 | HR | Medium |
| E1_10 | C1 | 11156 | 64 | 77 | Marketing | Medium |
| E1_11 | C1 | 4598 | 72 | 87 | HR | High |
| E1_12 | C1 | 7088 | 86 | 94 | Operations | Top Performer |
| E1_13 | C1 | 3040 | 85 | 97 | Marketing | Top Performer |
| E1_14 | C1 | 5503 | 81 | 91 | HR | Top Performer |
| E1_15 | C1 | 11565 | 93 | 73 | Operations | Medium |
| E1_16 | C1 | 5116 | 84 | 89 | HR | High |
| E1_17 | C1 | 10126 | 94 | 77 | Marketing | Medium |
| E1_18 | C1 | 6229 | 94 | 83 | Operations | High |
| E1_19 | C1 | 7575 | 66 | 72 | Finance | Medium |
| E1_20 | C1 | 8966 | 80 | 74 | IT | Medium |
# =========================
# 📊 SUMMARY PER COMPANY
# =========================
summary_company <- df %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_KPI = mean(KPI_score),
top_performers = sum(kpi_tier == "Top Performer")
)
cat(" Table 2: Company Summary\n")## Table 2: Company Summary
| company_id | avg_salary | avg_KPI | top_performers |
|---|---|---|---|
| C1 | 7480.175 | 83.65000 | 19 |
| C2 | 7101.016 | 84.63934 | 22 |
| C3 | 7890.163 | 84.62245 | 34 |
| C4 | 7193.750 | 86.32292 | 38 |
| C5 | 7903.556 | 86.95556 | 40 |
# =========================
# 📊 DEPARTMENT ANALYSIS
# =========================
dept_analysis <- df %>%
group_by(company_id, department) %>%
summarise(count = n(), .groups = "drop")
cat("Table 3: Department Distribution\n")## Table 3: Department Distribution
| company_id | department | count |
|---|---|---|
| C1 | Finance | 18 |
| C1 | HR | 14 |
| C1 | IT | 11 |
| C1 | Marketing | 16 |
| C1 | Operations | 21 |
| C2 | Finance | 11 |
| C2 | HR | 9 |
| C2 | IT | 14 |
| C2 | Marketing | 19 |
| C2 | Operations | 8 |
| C3 | Finance | 19 |
| C3 | HR | 19 |
| C3 | IT | 17 |
| C3 | Marketing | 24 |
| C3 | Operations | 19 |
| C4 | Finance | 13 |
| C4 | HR | 26 |
| C4 | IT | 21 |
| C4 | Marketing | 19 |
| C4 | Operations | 17 |
| C5 | Finance | 20 |
| C5 | HR | 19 |
| C5 | IT | 21 |
| C5 | Marketing | 14 |
| C5 | Operations | 16 |
# =========================
# 📈 GROUPED BAR (DEPARTMENT)
# =========================
bar_dept <- plot_ly(dept_analysis,
x = ~department,
y = ~count,
color = ~company_id,
type = "bar")
bar_dept <- bar_dept %>%
layout(title = "Department Distribution per Company",
barmode = "group")
bar_dept# =========================
# 📈 SCATTER + REGRESSION
# =========================
scatter <- plot_ly(df,
x = ~salary,
y = ~KPI_score,
color = ~company_id,
type = "scatter",
mode = "markers")
scatter <- scatter %>%
layout(title = "Salary vs KPI Score")
scatter7.2 Interpretation
This mini project simulates a company KPI dashboard by generating employee data across multiple companies. Each employee is assigned attributes such as salary, performance score, KPI score, and department.
A loop-based categorization is used to classify employees into KPI tiers, highlighting top performers and performance distribution. The summary table provides key metrics per company, including average salary, average KPI, and the number of top performers.
The visualizations offer deeper insights:
- Grouped bar charts show department distribution across companies.
- Scatter plots with regression lines reveal the relationship between salary and KPI performance.
- Histograms illustrate salary distribution patterns.
Overall, this simulation reflects real-world data analysis workflows, combining data generation, transformation, and visualization into a comprehensive KPI dashboard.
8 Automated Report Generation (Bonus)
library(ggplot2)
library(dplyr)
library(knitr)
# =========================
# SAMPLE DATA (jika belum ada)
# =========================
set.seed(123)
df_company <- data.frame(
company_id = sample(1:3, 150, replace = TRUE),
salary = runif(150, 3000, 10000),
KPI_score = runif(150, 50, 100),
performance_score = runif(150, 50, 100),
department = sample(c("IT","HR","Finance","Marketing"), 150, replace = TRUE)
)
# =========================
# KPI TIER
# =========================
df_company$kpi_tier <- ifelse(df_company$KPI_score >= 90, "Top Performer",
ifelse(df_company$KPI_score >= 80, "High",
ifelse(df_company$KPI_score >= 70, "Medium", "Low")))
# =========================
# FUNCTION: AUTOMATED REPORT
# =========================
generate_report <- function(data) {
for(c in unique(data$company_id)){
cat("\n====================================\n")
cat("Company ID:", c, "\n")
cat("====================================\n")
data_subset <- data %>% filter(company_id == c)
# =========================
# TABLE 1: SUMMARY
# =========================
summary_table <- data_subset %>%
summarise(
avg_salary = round(mean(salary),2),
avg_KPI = round(mean(KPI_score),2),
total_employee = n(),
top_performer = sum(kpi_tier == "Top Performer")
)
cat("\nTable 1: Summary\n")
print(kable(summary_table))
# =========================
# TABLE 2: TOP PERFORMERS
# =========================
top_data <- data_subset %>%
filter(kpi_tier == "Top Performer") %>%
arrange(desc(KPI_score)) %>%
head(5)
cat("\nTable 2: Top Performers\n")
print(kable(top_data))
# =========================
# PLOT 1: DEPARTMENT DISTRIBUTION
# =========================
p1 <- ggplot(data_subset, aes(x = department, fill = department)) +
geom_bar() +
labs(title = paste("Department Distribution - Company", c),
x = "Department", y = "Number of Employees") +
theme_minimal() +
theme(legend.position = "none")
print(p1)
# =========================
# PLOT 2: SALARY vs KPI (IMPROVED)
# =========================
p2 <- ggplot(data_subset, aes(x = salary, y = KPI_score, color = department)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = paste("Salary vs KPI - Company", c),
x = "Salary", y = "KPI Score") +
theme_minimal()
print(p2)
# =========================
# PLOT 3: SALARY DISTRIBUTION (IMPROVED)
# =========================
p3 <- ggplot(data_subset, aes(x = salary, fill = department)) +
geom_histogram(bins = 15, alpha = 0.6, position = "identity") +
labs(title = paste("Salary Distribution - Company", c),
x = "Salary", y = "Frequency") +
theme_minimal()
print(p3)
# =========================
# EXPORT CSV
# =========================
write.csv(data_subset,
paste0("company_", c, ".csv"),
row.names = FALSE)
cat("\n\n")
}
}
# =========================
# RUN
# =========================
generate_report(df_company)##
## ====================================
## Company ID: 3
## ====================================
##
## Table 1: Summary
##
##
## | avg_salary| avg_KPI| total_employee| top_performer|
## |----------:|-------:|--------------:|-------------:|
## | 6640.58| 75.04| 54| 10|
##
## Table 2: Top Performers
##
##
## | company_id| salary| KPI_score| performance_score|department |kpi_tier |
## |----------:|--------:|---------:|-----------------:|:----------|:-------------|
## | 3| 4800.517| 99.30271| 66.37987|HR |Top Performer |
## | 3| 9313.121| 98.89267| 91.72005|Marketing |Top Performer |
## | 3| 5727.110| 98.73629| 81.48727|Marketing |Top Performer |
## | 3| 9409.785| 98.56712| 65.14438|HR |Top Performer |
## | 3| 9736.513| 98.37347| 68.32207|IT |Top Performer |
##
##
##
## ====================================
## Company ID: 2
## ====================================
##
## Table 1: Summary
##
##
## | avg_salary| avg_KPI| total_employee| top_performer|
## |----------:|-------:|--------------:|-------------:|
## | 6260.05| 71.2| 54| 8|
##
## Table 2: Top Performers
##
##
## | company_id| salary| KPI_score| performance_score|department |kpi_tier |
## |----------:|--------:|---------:|-----------------:|:----------|:-------------|
## | 2| 4574.897| 99.83086| 68.39480|Finance |Top Performer |
## | 2| 8310.152| 97.65506| 93.93370|Finance |Top Performer |
## | 2| 4513.784| 94.83693| 93.00534|HR |Top Performer |
## | 2| 6366.376| 94.50390| 90.07148|HR |Top Performer |
## | 2| 8098.761| 93.32417| 65.60564|Finance |Top Performer |
##
##
##
## ====================================
## Company ID: 1
## ====================================
##
## Table 1: Summary
##
##
## | avg_salary| avg_KPI| total_employee| top_performer|
## |----------:|-------:|--------------:|-------------:|
## | 6523.54| 77.18| 42| 10|
##
## Table 2: Top Performers
##
##
## | company_id| salary| KPI_score| performance_score|department |kpi_tier |
## |----------:|--------:|---------:|-----------------:|:----------|:-------------|
## | 1| 8086.918| 99.56183| 69.25868|Marketing |Top Performer |
## | 1| 9161.726| 99.29771| 58.90069|HR |Top Performer |
## | 1| 5847.828| 99.16751| 62.65495|Finance |Top Performer |
## | 1| 6827.783| 98.85495| 90.04741|Marketing |Top Performer |
## | 1| 5766.541| 98.19217| 83.95067|Marketing |Top Performer |
9 Conclusion & Reference
This practicum demonstrates the application of advanced programming concepts in data science using R, particularly through the integration of functions, loops, and conditional logic. Each task simulates real-world analytical scenarios, enabling a deeper understanding of how structured programming supports data-driven decision-making.
The Dynamic Multi-Formula function highlights how flexible models can be built to evaluate different mathematical behaviors simultaneously. The Nested Simulation and Performance Categorization tasks illustrate how iterative processes and logical conditions can be used to simulate business operations and classify performance effectively.
Furthermore, the Monte Carlo Simulation showcases the power of probabilistic methods in estimating mathematical constants and analyzing uncertainty through random sampling. The Advanced Data Transformation and Feature Engineering task emphasizes the importance of preparing and transforming data to improve interpretability and analytical quality.
The Mini Project and Automated Report Generation tasks represent a comprehensive data science workflow, combining data generation, transformation, visualization, and reporting. These tasks demonstrate how automated systems can generate insights efficiently across multiple entities, such as companies or departments.
Overall, this practicum reinforces the importance of combining programming logic with analytical thinking. It shows that well-structured code can be used not only to process data but also to generate meaningful insights, build interactive visualizations, and automate reporting processes in real-world data science applications.
| No | Author | Year | Title | Publisher |
|---|---|---|---|---|
| 1 | Wickham, H. | 2016 | ggplot2: Elegant Graphics for Data Analysis | Springer |
| 2 | Wickham, H. et al. | 2023 | dplyr: A Grammar of Data Manipulation | R Package Documentation |
| 3 | Sievert, C. | 2020 | Interactive Web-Based Data Visualization with R, plotly, and shiny | CRC Press |
| 4 | R Core Team | 2023 | R: A Language and Environment for Statistical Computing | R Foundation |
| 5 | James, G. et al. | 2021 | An Introduction to Statistical Learning | Springer |
| 6 | Ross, S. | 2014 | Introduction to Probability Models | Academic Press |