Practicum Week-5 explores advanced programming paradigms in R — focusing on building reusable functions, applying nested loops, and constructing real-world data science pipelines. The eight tasks below progressively build toward a full automated KPI Dashboard and report system.
The following packages power all computation, visualization, and reporting in this practicum. Install them once with install.packages() if not already available.
# ============================================================
# LIBRARIES — all packages required for Practicum Week-5
# ============================================================
library(ggplot2) # Grammar of Graphics — all plots
library(dplyr) # Data wrangling and transformation
library(tidyr) # Tidy data reshaping (pivot_wider, etc.)
library(scales) # Axis label formatting (comma, percent)
library(knitr) # Table rendering in HTML output
library(kableExtra) # Enhanced kable table styling
library(gridExtra) # Arrange multiple ggplots in one figureImplement compute_formula(x, formula) supporting four formula types: linear, quadratic, cubic, and exponential. A nested loop computes all formulas simultaneously; input validation catches invalid formula names. All four curves are plotted together on a log-y axis for clear comparison.
# ============================================================
# TASK 1: Dynamic Multi-Formula Function
# compute_formula(x, formula) supports 4 formula types
# Nested loop: outer = formula type, inner = x values
# Input validation: stops with informative error if invalid
# ============================================================
compute_formula <- function(x, formula) {
# --- Input Validation ---
valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
if (!formula %in% valid_formulas) {
stop(paste0(
"Invalid formula: '", formula, "'. ",
"Choose one of: ", paste(valid_formulas, collapse = ", ")
))
}
# Pre-allocate output vector
result <- numeric(length(x))
# FOR LOOP: compute formula value for each x
for (i in seq_along(x)) {
xi <- x[i]
result[i] <- switch(formula,
"linear" = 2 * xi + 3, # f(x) = 2x + 3
"quadratic" = xi^2 + 2 * xi + 1, # f(x) = x² + 2x + 1
"cubic" = xi^3 - 5 * xi^2 + xi, # f(x) = x³ - 5x² + x
"exponential" = exp(0.5 * xi) # f(x) = e^(0.5x)
)
}
return(result)
}
# --- Build tidy data frame for all 4 formulas over x = 1:20 ---
x_seq <- 1:20
formula_list <- c("linear", "quadratic", "cubic", "exponential")
# OUTER LOOP: formula type | INNER LOOP (implicit via compute_formula): x values
df_formulas <- do.call(rbind, lapply(formula_list, function(f) {
data.frame(x = x_seq, y = compute_formula(x_seq, f), formula = f)
}))
# Preview sample values
df_formulas |>
dplyr::filter(x %in% c(1, 5, 10, 20)) |>
tidyr::pivot_wider(names_from = formula, values_from = y)# ============================================================
# TASK 1: Visualization — 4 formulas on a log-y scale
# Navy/gold/coral/teal palette; points + lines; log scale
# ============================================================
formula_palette <- c(
"linear" = "#F4A72A",
"quadratic" = "#00BFA5",
"cubic" = "#E85D5D",
"exponential" = "#7C6FCD"
)
ggplot(df_formulas, aes(x = x, y = y, color = formula)) +
geom_line(linewidth = 1.6, alpha = 0.90) +
geom_point(size = 3.0, alpha = 0.85) +
scale_color_manual(values = formula_palette) +
scale_y_log10(labels = comma) +
labs(
title = "Task 1 — Dynamic Multi-Formula Comparison (x = 1 to 20)",
subtitle = "Linear · Quadratic · Cubic · Exponential plotted on a log-y scale",
x = "x value",
y = "f(x) — logarithmic scale",
color = "Formula Type",
caption = "Source: Practicum Week-5 — compute_formula()"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 15, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "top",
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "#FAFBFF", color = NA)
)Build simulate_sales(n_salesperson, days) with a nested helper function apply_discount() for conditional discount logic. The outer loop iterates over salespersons; the inner loop covers each day. Cumulative net sales are tracked and visualized.
# ============================================================
# TASK 2: Nested Simulation — Sales & Discounts
# Outer loop: salesperson | Inner loop: day
# Nested helper function: apply_discount(amount)
# ============================================================
set.seed(2025) # reproducibility
simulate_sales <- function(n_salesperson, days) {
# --- NESTED HELPER FUNCTION: conditional discount tiers ---
apply_discount <- function(amount) {
if (amount > 850) return(0.15) # 15%: high-value sale
else if (amount > 550) return(0.10) # 10%: mid-value sale
else return(0.05) # 5%: low-value sale
}
records <- list()
idx <- 1
# OUTER LOOP: each salesperson
for (sp in 1:n_salesperson) {
cumulative <- 0
# INNER LOOP: each day for this salesperson
for (d in 1:days) {
amount <- round(runif(1, 180, 1000), 2)
disc_rate <- apply_discount(amount)
net_amount <- round(amount * (1 - disc_rate), 2)
cumulative <- cumulative + net_amount
records[[idx]] <- data.frame(
sales_id = sp,
day = d,
sales_amount = amount,
discount_rate = disc_rate,
net_amount = net_amount,
cumulative_net = round(cumulative, 2)
)
idx <- idx + 1
}
}
return(dplyr::bind_rows(records))
}
# Run: 5 salespersons × 10 days
sales_df <- simulate_sales(n_salesperson = 5, days = 10)
head(sales_df, 10)# ============================================================
# TASK 2: Aggregate summary per salesperson
# ============================================================
sales_summary <- sales_df |>
dplyr::group_by(sales_id) |>
dplyr::summarise(
Gross_Sales = round(sum(sales_amount), 0),
Net_Revenue = round(sum(net_amount), 0),
Avg_Discount = paste0(round(mean(discount_rate) * 100, 1), "%"),
Peak_Cumul = round(max(cumulative_net), 0),
.groups = "drop"
)
kable(sales_summary,
caption = "Task 2 — Salesperson Summary (5 SP × 10 Days)",
col.names = c("SP ID","Gross Sales","Net Revenue","Avg Discount","Peak Cumul.")) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)| SP ID | Gross Sales | Net Revenue | Avg Discount | Peak Cumul. |
|---|---|---|---|---|
| 1 | 6512 | 5829 | 10% | 5829 |
| 2 | 5890 | 5307 | 8.5% | 5307 |
| 3 | 6117 | 5508 | 8.5% | 5508 |
| 4 | 5520 | 4956 | 8.5% | 4956 |
| 5 | 7214 | 6436 | 10% | 6436 |
# ============================================================
# TASK 2: Cumulative net sales line chart per salesperson
# ============================================================
sales_df$sales_id <- factor(sales_df$sales_id, labels = paste0("SP-", 1:5))
ggplot(sales_df, aes(x = day, y = cumulative_net, color = sales_id)) +
geom_line(linewidth = 1.4) +
geom_point(size = 3.2, shape = 21, aes(fill = sales_id),
color = "white", stroke = 1.6) +
scale_color_manual(values = c("#F4A72A","#E85D5D","#00BFA5","#7C6FCD","#1E88E5")) +
scale_fill_manual(values = c("#F4A72A","#E85D5D","#00BFA5","#7C6FCD","#1E88E5")) +
scale_y_continuous(labels = comma) +
scale_x_continuous(breaks = 1:10) +
labs(
title = "Task 2 — Cumulative Net Sales per Salesperson (10 Days)",
subtitle = "Discount tiers: >850 → 15% | >550 → 10% | ≤550 → 5%",
x = "Day", y = "Cumulative Net Sales",
color = "Salesperson", fill = "Salesperson",
caption = "Source: Practicum Week-5 — simulate_sales()"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "right",
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "#FAFBFF", color = NA)
)apply_discount() function cleanly separates business logic from the simulation loop — a key software design principle.
Build categorize_performance(sales_amount) that classifies values into five tiers: Excellent, Very Good, Good, Average, Poor. A for loop processes the full vector element-by-element. Output includes a percentage table, bar chart, and pie chart.
# ============================================================
# TASK 3: Multi-Level Performance Categorization
# categorize_performance(x) → character vector (5 tiers)
# For loop + nested if-else chain processes each element
# ============================================================
categorize_performance <- function(sales_amount) {
category <- character(length(sales_amount)) # pre-allocate
for (i in seq_along(sales_amount)) {
val <- sales_amount[i]
if (val >= 900) category[i] <- "Excellent" # ≥ 900
else if (val >= 700) category[i] <- "Very Good" # 700–899
else if (val >= 500) category[i] <- "Good" # 500–699
else if (val >= 300) category[i] <- "Average" # 300–499
else category[i] <- "Poor" # < 300
}
return(category)
}
# Apply to 120 simulated uniform sales values
set.seed(2025)
raw_sales <- round(runif(120, 100, 1000), 2)
categories <- categorize_performance(raw_sales)
perf_df <- data.frame(
sales_amount = raw_sales,
category = factor(categories,
levels = c("Poor","Average","Good","Very Good","Excellent"))
)
perf_pct <- perf_df |>
dplyr::count(category) |>
dplyr::mutate(
pct = round(n / sum(n) * 100, 1),
label = paste0(n, " (", pct, "%)")
)
kable(perf_pct, caption = "Task 3 — Category Distribution (n=120 sales records)",
col.names = c("Category","Count","Percentage (%)","Label")) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)| Category | Count | Percentage (%) | Label |
|---|---|---|---|
| Poor | 28 | 23.3 | 28 (23.3%) |
| Average | 21 | 17.5 | 21 (17.5%) |
| Good | 26 | 21.7 | 26 (21.7%) |
| Very Good | 30 | 25.0 | 30 (25%) |
| Excellent | 15 | 12.5 | 15 (12.5%) |
# ============================================================
# TASK 3: Bar chart — category frequency with navy/gold palette
# ============================================================
tier_colors <- c(
"Poor" = "#E85D5D",
"Average" = "#F4A72A",
"Good" = "#00BFA5",
"Very Good" = "#1E88E5",
"Excellent" = "#7C6FCD"
)
ggplot(perf_pct, aes(x = category, y = n, fill = category)) +
geom_col(width = 0.60, show.legend = FALSE, alpha = 0.88) +
geom_text(aes(label = label), vjust = -0.45,
fontface = "bold", size = 4.0, color = "#1A1A2E") +
scale_fill_manual(values = tier_colors) +
labs(
title = "Task 3 — Performance Category Distribution (Bar Chart)",
subtitle = "120 simulated sales values drawn from uniform distribution (100–1000)",
x = "Performance Category",
y = "Number of Records",
caption = "Source: Practicum Week-5 — categorize_performance()"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)# ============================================================
# TASK 3: Pie chart — proportional breakdown by category
# ============================================================
ggplot(perf_pct, aes(x = "", y = n, fill = category)) +
geom_col(width = 1, color = "white", linewidth = 0.9) +
coord_polar(theta = "y") +
geom_text(aes(label = paste0(category, "\n", pct, "%")),
position = position_stack(vjust = 0.5),
fontface = "bold", size = 3.8, color = "#1A1A2E") +
scale_fill_manual(values = tier_colors) +
labs(
title = "Task 3 — Performance Category Distribution (Pie Chart)",
fill = "Category",
caption = "Source: Practicum Week-5 — categorize_performance()"
) +
theme_void() +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5, color = "#3D2B6B"),
legend.position = "right"
)categorize_performance() logic. In practice, a real sales dataset would exhibit skewed distributions, making the categorization function even more valuable for discovering performance concentration patterns.
Build generate_company_data(n_company, n_employees) using nested loops (outer: company, inner: employee). Conditional logic flags top performers where KPI > 90. Output includes a per-company summary table and a salary–KPI scatter plot with regression lines.
# ============================================================
# TASK 4: Multi-Company Dataset — Nested Loops
# Outer loop: company_id | Inner loop: employee within company
# Conditional logic: top_performer flag (KPI > 90)
# ============================================================
set.seed(42)
generate_company_data <- function(n_company, n_employees) {
depts <- c("Engineering","Marketing","Finance","Operations","HR")
records <- list()
idx <- 1
# OUTER LOOP: each company
for (co in 1:n_company) {
company_id <- paste0("CO-", LETTERS[co]) # CO-A, CO-B, ...
# INNER LOOP: each employee in this company
for (emp in 1:n_employees) {
salary <- round(runif(1, 4500, 16000), 0)
performance_score <- round(runif(1, 55, 100), 1)
KPI_score <- round(runif(1, 50, 100), 1)
department <- sample(depts, 1)
# Conditional flag: top performer if KPI > 90
top_performer <- ifelse(KPI_score > 90, "Yes", "No")
records[[idx]] <- data.frame(
company_id = company_id,
employee_id = paste0(company_id, "-EMP", sprintf("%03d", emp)),
salary = salary,
department = department,
performance_score = performance_score,
KPI_score = KPI_score,
top_performer = top_performer
)
idx <- idx + 1
}
}
return(dplyr::bind_rows(records))
}
# Generate: 3 companies × 40 employees each
company_df <- generate_company_data(n_company = 3, n_employees = 40)
cat(sprintf("Dataset: %d employees across %d companies\n",
nrow(company_df), length(unique(company_df$company_id))))Dataset: 120 employees across 3 companies
# ============================================================
# TASK 4: Company-level aggregate summary
# ============================================================
company_summary <- company_df |>
dplyr::group_by(company_id) |>
dplyr::summarise(
Employees = n(),
Avg_Salary = formatC(round(mean(salary), 0), big.mark = ",", format = "d"),
Avg_Perf = round(mean(performance_score), 2),
Avg_KPI = round(mean(KPI_score), 2),
Max_KPI = round(max(KPI_score), 1),
Top_Performers = sum(top_performer == "Yes"),
.groups = "drop"
)
kable(company_summary,
caption = "Task 4 — Multi-Company Summary (3 Companies × 40 Employees)",
col.names = c("Company","Employees","Avg Salary","Avg Perf","Avg KPI","Max KPI","Top Performers")) |>
kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)| Company | Employees | Avg Salary | Avg Perf | Avg KPI | Max KPI | Top Performers |
|---|---|---|---|---|---|---|
| CO-A | 40 | 10,715 | 80.54 | 77.20 | 98.9 | 6 |
| CO-B | 40 | 9,964 | 77.03 | 70.45 | 99.0 | 4 |
| CO-C | 40 | 10,067 | 76.68 | 72.12 | 99.0 | 7 |
# ============================================================
# TASK 4: Scatter + regression line, colored by company
# Top performers marked with a distinct shape (triangle)
# ============================================================
ggplot(company_df, aes(x = KPI_score, y = salary,
color = company_id,
shape = top_performer)) +
geom_point(size = 3.0, alpha = 0.72) +
geom_smooth(aes(group = company_id), method = "lm",
se = FALSE, linewidth = 1.1, alpha = 0.85) +
scale_color_manual(values = c("CO-A" = "#F4A72A",
"CO-B" = "#E85D5D",
"CO-C" = "#00BFA5")) +
scale_shape_manual(values = c("No" = 16, "Yes" = 17)) +
scale_y_continuous(labels = comma) +
labs(
title = "Task 4 — Salary vs KPI Score by Company",
subtitle = "Triangles = Top Performers (KPI > 90) | Lines = OLS regression per company",
x = "KPI Score", y = "Monthly Salary",
color = "Company", shape = "Top Performer",
caption = "Source: Practicum Week-5 — generate_company_data()"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "right",
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "#FAFBFF", color = NA)
)Build monte_carlo_pi(n_points) that estimates π by sampling random points in the unit square and testing whether they fall inside the unit circle. A secondary analysis computes the probability of points landing inside a sub-square region. All logic uses a for loop.
# ============================================================
# TASK 5: Monte Carlo Pi Estimation
# monte_carlo_pi(n_points) → list(pi_estimate, prob, coords)
# FOR LOOP: classify each point inside/outside unit circle
# ============================================================
set.seed(123) # fixed seed for reproducibility
monte_carlo_pi <- function(n_points) {
x_pts <- runif(n_points, -1, 1)
y_pts <- runif(n_points, -1, 1)
in_circle <- numeric(n_points)
in_subsquare <- numeric(n_points)
# FOR LOOP: classify each point
for (i in 1:n_points) {
# Unit circle check: x² + y² ≤ 1
if (x_pts[i]^2 + y_pts[i]^2 <= 1) {
in_circle[i] <- 1
}
# Sub-square check: both |x| ≤ 0.5 and |y| ≤ 0.5
# Theoretical probability: (1×1) / (2×2) = 0.25
if (abs(x_pts[i]) <= 0.5 && abs(y_pts[i]) <= 0.5) {
in_subsquare[i] <- 1
}
}
pi_est <- 4 * sum(in_circle) / n_points
prob_sub <- sum(in_subsquare) / n_points
return(list(
pi_estimate = pi_est,
prob_subsquare = prob_sub,
x = x_pts,
y = y_pts,
in_circle = in_circle
))
}
# Run with 6,000 points
mc <- monte_carlo_pi(6000)
cat("=== Monte Carlo Results (n = 6,000) ===\n")=== Monte Carlo Results (n = 6,000) ===
Estimated π : 3.146000
True π (built-in) : 3.141593
Absolute Error : 0.004407
Error (%) : 0.1403%
Sub-square prob : 0.2617
Expected (1/4) : 0.2500
# ============================================================
# TASK 5: Plot — Monte Carlo sampling scatter
# Gold = inside circle | Coral = outside circle
# Navy circle boundary | Purple dashed sub-square
# ============================================================
mc_df <- data.frame(
x = mc$x,
y = mc$y,
in_circle = factor(mc$in_circle,
levels = c(0, 1),
labels = c("Outside Circle","Inside Circle"))
)
ggplot(mc_df, aes(x = x, y = y, color = in_circle)) +
geom_point(size = 0.60, alpha = 0.50) +
annotate("path",
x = cos(seq(0, 2 * pi, length.out = 300)),
y = sin(seq(0, 2 * pi, length.out = 300)),
color = "#3D2B6B", linewidth = 1.4) +
annotate("rect", xmin = -0.5, xmax = 0.5, ymin = -0.5, ymax = 0.5,
color = "#7C6FCD", fill = NA, linewidth = 1.2, linetype = "dashed") +
scale_color_manual(values = c("Outside Circle" = "#E85D5D",
"Inside Circle" = "#F4A72A")) +
coord_fixed() +
labs(
title = "Task 5 — Monte Carlo π Estimation (n = 6,000)",
subtitle = paste0("Estimated π = ", round(mc$pi_estimate, 5),
" | Sub-square hit rate = ",
round(mc$prob_subsquare, 4)),
x = "x coordinate", y = "y coordinate",
color = "Point Region",
caption = "Navy = unit circle boundary | Purple dashed = sub-square [-0.5, 0.5]²"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "bottom",
panel.grid.minor = element_blank()
)Build normalize_columns(df) (Min-Max) and z_score(df) (Z-Score Standardization) using loop-based column iteration. Then engineer two new categorical features. Compare salary distributions before and after transformation with faceted histograms and a violin+boxplot.
# ============================================================
# TASK 6: Loop-based normalization & standardization
# normalize_columns(df) → Min-Max scaling to [0, 1]
# z_score(df) → Mean=0, SD=1 standardization
# Both iterate over column names with a for loop
# ============================================================
# --- Function 1: Min-Max Normalization ---
normalize_columns <- function(df) {
num_cols <- names(df)[sapply(df, is.numeric)]
result <- df
for (col in num_cols) { # FOR LOOP over columns
lo <- min(df[[col]], na.rm = TRUE)
hi <- max(df[[col]], na.rm = TRUE)
result[[col]] <- (df[[col]] - lo) / (hi - lo) # Min-Max formula
}
return(result)
}
# --- Function 2: Z-Score Standardization ---
z_score <- function(df) {
num_cols <- names(df)[sapply(df, is.numeric)]
result <- df
for (col in num_cols) { # FOR LOOP over columns
mu <- mean(df[[col]], na.rm = TRUE)
sigma <- sd(df[[col]], na.rm = TRUE)
result[[col]] <- (df[[col]] - mu) / sigma # Z-Score formula
}
return(result)
}
# Apply both functions to numeric columns from Task 4 dataset
company_num <- company_df |> dplyr::select(salary, performance_score, KPI_score)
company_norm <- normalize_columns(company_num)
company_zscore <- z_score(company_num)
cat("--- ORIGINAL ---\n")--- ORIGINAL ---
salary performance_score KPI_score
Min. : 4527 Min. :55.00 Min. :50.20
1st Qu.: 7788 1st Qu.:63.12 1st Qu.:60.40
Median : 9993 Median :79.20 Median :71.75
Mean :10249 Mean :78.09 Mean :73.26
3rd Qu.:13049 3rd Qu.:90.40 3rd Qu.:86.53
Max. :15872 Max. :99.60 Max. :99.00
--- MIN-MAX NORMALIZED [0,1] ---
salary performance_score KPI_score
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.2874 1st Qu.:0.1822 1st Qu.:0.2090
Median :0.4818 Median :0.5426 Median :0.4416
Mean :0.5043 Mean :0.5177 Mean :0.4725
3rd Qu.:0.7512 3rd Qu.:0.7937 3rd Qu.:0.7444
Max. :1.0000 Max. :1.0000 Max. :1.0000
--- Z-SCORE STANDARDIZED (mean≈0, sd=1) ---
salary performance_score KPI_score
Min. :-1.74920 Min. :-1.64529 Min. :-1.5877
1st Qu.:-0.75232 1st Qu.:-1.06628 1st Qu.:-0.8853
Median :-0.07812 Median : 0.07928 Median :-0.1038
Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.85624 3rd Qu.: 0.87743 3rd Qu.: 0.9136
Max. : 1.71921 Max. : 1.53305 Max. : 1.7726
# ============================================================
# TASK 6: Feature Engineering — 2 new categorical columns
# performance_category: Low / Medium / High
# salary_bracket: Entry / Mid / Senior
# ============================================================
company_df <- company_df |>
dplyr::mutate(
performance_category = dplyr::case_when(
performance_score >= 87 ~ "High",
performance_score >= 72 ~ "Medium",
TRUE ~ "Low"
),
salary_bracket = dplyr::case_when(
salary >= 13000 ~ "Senior",
salary >= 8000 ~ "Mid",
TRUE ~ "Entry"
)
)
cat("=== New Feature: performance_category ===\n")=== New Feature: performance_category ===
High Low Medium
40 43 37
=== New Feature: salary_bracket ===
Entry Mid Senior
33 56 31
# ============================================================
# TASK 6: Faceted histogram — salary in 3 forms
# ============================================================
hist_df <- dplyr::bind_rows(
data.frame(value = company_num$salary, type = "1. Original"),
data.frame(value = company_norm$salary, type = "2. Min-Max [0,1]"),
data.frame(value = company_zscore$salary, type = "3. Z-Score")
)
ggplot(hist_df, aes(x = value, fill = type)) +
geom_histogram(bins = 18, color = "white", alpha = 0.88) +
facet_wrap(~type, scales = "free_x", nrow = 1) +
scale_fill_manual(values = c(
"1. Original" = "#F4A72A",
"2. Min-Max [0,1]" = "#00BFA5",
"3. Z-Score" = "#7C6FCD"
)) +
labs(
title = "Task 6 — Salary Distribution: Original vs Transformed",
subtitle = "Shape preserved across all three — only the axis scale changes",
x = "Value", y = "Count",
caption = "Source: Practicum Week-5 — normalize_columns() & z_score()"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 13, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "none",
strip.text = element_text(face = "bold", size = 10),
panel.grid.minor = element_blank()
)# ============================================================
# TASK 6: Violin + Boxplot — salary by performance category
# ============================================================
company_df$performance_category <- factor(
company_df$performance_category, levels = c("Low","Medium","High")
)
ggplot(company_df, aes(x = performance_category, y = salary,
fill = performance_category)) +
geom_violin(alpha = 0.50, trim = FALSE) +
geom_boxplot(width = 0.18, alpha = 0.88, outlier.size = 3,
outlier.color = "#E85D5D", color = "#1A1A2E") +
geom_jitter(width = 0.06, alpha = 0.40, size = 2, color = "#455A64") +
scale_fill_manual(values = c("Low" = "#E85D5D",
"Medium" = "#F4A72A",
"High" = "#00BFA5")) +
scale_y_continuous(labels = comma) +
labs(
title = "Task 6 — Salary Distribution by Performance Category",
subtitle = "Violin = distribution shape | Box = median & IQR | Dots = individual employees",
x = "Performance Category", y = "Monthly Salary", fill = "Category",
caption = "Source: Practicum Week-5 — Feature Engineering"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "none",
panel.grid.minor = element_blank()
)Generate a full dataset for 5 companies × 60 employees each (300 rows). Use a for loop to classify employees into 4 KPI tiers. Produce: a summary table, grouped bar chart, department scatter, and salary area chart.
# ============================================================
# TASK 7: Full KPI Dashboard — 5 companies × 60 employees
# Reuses generate_company_data() from Task 4
# FOR LOOP: classifies each employee into a KPI tier
# ============================================================
set.seed(77)
dashboard_df <- generate_company_data(n_company = 5, n_employees = 60)
# --- FOR LOOP: KPI tier classification ---
kpi_tier <- character(nrow(dashboard_df))
for (i in 1:nrow(dashboard_df)) {
kpi <- dashboard_df$KPI_score[i]
if (kpi >= 90) kpi_tier[i] <- "Elite" # Top achievers
else if (kpi >= 75) kpi_tier[i] <- "Solid" # Consistent performers
else if (kpi >= 60) kpi_tier[i] <- "Growing" # Developing
else kpi_tier[i] <- "At Risk" # Needs support
}
dashboard_df$kpi_tier <- factor(kpi_tier,
levels = c("At Risk","Growing","Solid","Elite"))
cat(sprintf("Dashboard dataset: %d employees, %d companies\n",
nrow(dashboard_df), length(unique(dashboard_df$company_id))))Dashboard dataset: 300 employees, 5 companies
# ============================================================
# TASK 7: Aggregate KPI summary per company
# ============================================================
dashboard_summary <- dashboard_df |>
dplyr::group_by(company_id) |>
dplyr::summarise(
Avg_Salary = formatC(round(mean(salary), 0), big.mark = ",", format = "d"),
Avg_KPI = round(mean(KPI_score), 2),
Top_Performers= sum(top_performer == "Yes"),
Elite = sum(kpi_tier == "Elite"),
Growing = sum(kpi_tier == "Growing"),
At_Risk = sum(kpi_tier == "At Risk"),
.groups = "drop"
)
kable(dashboard_summary,
caption = "Task 7 — KPI Dashboard: Company Summary (5 Companies × 60 Employees)") |>
kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)| company_id | Avg_Salary | Avg_KPI | Top_Performers | Elite | Growing | At_Risk |
|---|---|---|---|---|---|---|
| CO-A | 10,281 | 73.80 | 13 | 13 | 19 | 15 |
| CO-B | 10,114 | 71.11 | 9 | 9 | 18 | 21 |
| CO-C | 10,538 | 75.46 | 14 | 14 | 18 | 13 |
| CO-D | 10,373 | 78.98 | 10 | 10 | 15 | 6 |
| CO-E | 10,629 | 77.84 | 16 | 16 | 12 | 11 |
# ============================================================
# TASK 7: Grouped bar chart — KPI tier count per company
# ============================================================
kpi_palette <- c(
"At Risk" = "#E85D5D",
"Growing" = "#F4A72A",
"Solid" = "#1E88E5",
"Elite" = "#7C6FCD"
)
tier_counts <- dashboard_df |>
dplyr::count(company_id, kpi_tier)
ggplot(tier_counts, aes(x = company_id, y = n, fill = kpi_tier)) +
geom_col(position = "dodge", width = 0.75, alpha = 0.90) +
geom_text(aes(label = n), position = position_dodge(width = 0.75),
vjust = -0.4, fontface = "bold", size = 3.5) +
scale_fill_manual(values = kpi_palette) +
labs(
title = "Task 7 — KPI Tier Distribution per Company (Grouped Bar)",
subtitle = "5 companies × 60 employees; 4 KPI performance tiers",
x = "Company", y = "Employee Count", fill = "KPI Tier",
caption = "Source: Practicum Week-5 — KPI Dashboard"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "top",
panel.grid.minor = element_blank()
)# ============================================================
# TASK 7: Faceted scatter + regression lines per company
# Color = KPI tier; shape = top_performer status
# ============================================================
ggplot(dashboard_df, aes(x = KPI_score, y = salary,
color = kpi_tier, shape = top_performer)) +
geom_point(size = 2.5, alpha = 0.72) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.8,
color = "#3D2B6B", alpha = 0.70) +
scale_color_manual(values = kpi_palette) +
scale_shape_manual(values = c("No" = 16, "Yes" = 17)) +
scale_y_continuous(labels = comma) +
facet_wrap(~company_id, nrow = 2) +
labs(
title = "Task 7 — KPI Score vs Salary by Company (Faceted)",
subtitle = "Triangles = Top Performers (KPI > 90) | Regression lines per facet",
x = "KPI Score", y = "Monthly Salary",
color = "KPI Tier", shape = "Top Performer",
caption = "Source: Practicum Week-5 — KPI Dashboard (n=300)"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 13, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 9),
legend.position = "right",
strip.text = element_text(face = "bold", color = "#0F3460"),
panel.grid.minor = element_blank()
)# ============================================================
# TASK 7: Overlapping area density chart — salary by KPI tier
# ============================================================
ggplot(dashboard_df, aes(x = salary, fill = kpi_tier, color = kpi_tier)) +
geom_density(alpha = 0.30, linewidth = 0.9) +
scale_fill_manual(values = kpi_palette) +
scale_color_manual(values = kpi_palette) +
scale_x_continuous(labels = comma) +
labs(
title = "Task 7 — Salary Density by KPI Tier (Area Chart)",
subtitle = "Overlapping densities reveal salary spread within each KPI tier",
x = "Monthly Salary", y = "Density",
fill = "KPI Tier", color = "KPI Tier",
caption = "Source: Practicum Week-5 — KPI Dashboard (n=300)"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "#3D2B6B"),
plot.subtitle = element_text(color = "#455A64", size = 10),
legend.position = "right",
panel.grid.minor = element_blank()
)Implement an automated report generation pipeline using functions + loops. A single generate_report() function encapsulates all per-company logic; a for loop calls it for every company automatically — producing structured text summaries, CSV export, and a grid of auto-generated mini-plots.
# ============================================================
# TASK 8 (BONUS): Automated Report Generation
# generate_report(cid, df) → prints full company summary
# Called inside a for loop — no manual repetition
# ============================================================
generate_report <- function(cid, df) {
cdata <- df |> dplyr::filter(company_id == cid)
n_emp <- nrow(cdata)
avg_salary <- round(mean(cdata$salary), 0)
avg_kpi <- round(mean(cdata$KPI_score), 2)
avg_perf <- round(mean(cdata$performance_score), 2)
n_top <- sum(cdata$top_performer == "Yes")
pct_top <- round(n_top / n_emp * 100, 1)
tier_tbl <- table(cdata$kpi_tier)
dept_summary <- cdata |>
dplyr::group_by(department) |>
dplyr::summarise(Count = n(), Avg_KPI = round(mean(KPI_score), 1), .groups = "drop") |>
dplyr::arrange(dplyr::desc(Avg_KPI))
top3 <- cdata |>
dplyr::arrange(dplyr::desc(KPI_score)) |>
dplyr::select(employee_id, department, KPI_score, salary) |>
head(3)
cat(rep("", 62), "\n", sep = "")
cat(sprintf(" AUTOMATED REPORT COMPANY %s\n", cid))
cat(rep("", 62), "\n\n", sep = "")
cat(sprintf(" Total Employees : %d\n", n_emp))
cat(sprintf(" Avg Monthly Salary : IDR %s\n",
formatC(avg_salary, big.mark = ",", format = "d")))
cat(sprintf(" Avg KPI Score : %.2f\n", avg_kpi))
cat(sprintf(" Avg Perf Score : %.2f\n", avg_perf))
cat(sprintf(" Top Performers : %d (%.1f%% of workforce)\n\n", n_top, pct_top))
cat(" KPI TIER BREAKDOWN \n")
for (tier in names(tier_tbl)) {
bar <- paste0(rep("", round(tier_tbl[[tier]] / n_emp * 28)), collapse = "")
cat(sprintf(" %-10s : %2d employees %s\n", tier, tier_tbl[[tier]], bar))
}
cat("\n DEPARTMENT SUMMARY \n")
for (r in 1:nrow(dept_summary)) {
cat(sprintf(" %-14s : %2d emp | Avg KPI = %.1f\n",
dept_summary$department[r], dept_summary$Count[r],
dept_summary$Avg_KPI[r]))
}
cat("\n TOP 3 PERFORMERS \n")
for (r in 1:nrow(top3)) {
cat(sprintf(" #%d %-16s | Dept: %-14s | KPI: %.1f | Salary: IDR %s\n",
r, top3$employee_id[r], top3$department[r],
top3$KPI_score[r],
formatC(top3$salary[r], big.mark = ",", format = "d")))
}
cat("\n")
}# ============================================================
# TASK 8 (BONUS): FOR LOOP — generate report for every company
# ============================================================
all_companies <- sort(unique(dashboard_df$company_id))
for (cid in all_companies) {
generate_report(cid, dashboard_df)
}
AUTOMATED REPORT COMPANY CO-A
Total Employees : 60
Avg Monthly Salary : IDR 10,281
Avg KPI Score : 73.80
Avg Perf Score : 78.05
Top Performers : 13 (21.7% of workforce)
KPI TIER BREAKDOWN
At Risk : 15 employees
Growing : 19 employees
Solid : 13 employees
Elite : 13 employees
DEPARTMENT SUMMARY
HR : 13 emp | Avg KPI = 75.0
Marketing : 10 emp | Avg KPI = 74.4
Engineering : 13 emp | Avg KPI = 74.1
Finance : 17 emp | Avg KPI = 74.0
Operations : 7 emp | Avg KPI = 69.8
TOP 3 PERFORMERS
#1 CO-A-EMP015 | Dept: Engineering | KPI: 96.7 | Salary: IDR 15,272
#2 CO-A-EMP008 | Dept: HR | KPI: 96.0 | Salary: IDR 9,534
#3 CO-A-EMP013 | Dept: Finance | KPI: 95.1 | Salary: IDR 15,614
AUTOMATED REPORT COMPANY CO-B
Total Employees : 60
Avg Monthly Salary : IDR 10,114
Avg KPI Score : 71.11
Avg Perf Score : 77.75
Top Performers : 9 (15.0% of workforce)
KPI TIER BREAKDOWN
At Risk : 21 employees
Growing : 18 employees
Solid : 12 employees
Elite : 9 employees
DEPARTMENT SUMMARY
Marketing : 14 emp | Avg KPI = 73.1
Finance : 5 emp | Avg KPI = 72.9
HR : 12 emp | Avg KPI = 71.5
Engineering : 11 emp | Avg KPI = 71.2
Operations : 18 emp | Avg KPI = 68.8
TOP 3 PERFORMERS
#1 CO-B-EMP028 | Dept: Operations | KPI: 99.9 | Salary: IDR 13,965
#2 CO-B-EMP055 | Dept: Operations | KPI: 99.6 | Salary: IDR 10,350
#3 CO-B-EMP059 | Dept: Marketing | KPI: 96.3 | Salary: IDR 14,422
AUTOMATED REPORT COMPANY CO-C
Total Employees : 60
Avg Monthly Salary : IDR 10,538
Avg KPI Score : 75.46
Avg Perf Score : 79.04
Top Performers : 14 (23.3% of workforce)
KPI TIER BREAKDOWN
At Risk : 13 employees
Growing : 18 employees
Solid : 15 employees
Elite : 14 employees
DEPARTMENT SUMMARY
HR : 9 emp | Avg KPI = 83.7
Marketing : 9 emp | Avg KPI = 79.5
Engineering : 13 emp | Avg KPI = 75.9
Finance : 13 emp | Avg KPI = 71.7
Operations : 16 emp | Avg KPI = 71.3
TOP 3 PERFORMERS
#1 CO-C-EMP052 | Dept: HR | KPI: 100.0 | Salary: IDR 6,816
#2 CO-C-EMP045 | Dept: Marketing | KPI: 99.7 | Salary: IDR 10,957
#3 CO-C-EMP016 | Dept: Engineering | KPI: 99.4 | Salary: IDR 8,941
AUTOMATED REPORT COMPANY CO-D
Total Employees : 60
Avg Monthly Salary : IDR 10,373
Avg KPI Score : 78.98
Avg Perf Score : 77.19
Top Performers : 10 (16.7% of workforce)
KPI TIER BREAKDOWN
At Risk : 6 employees
Growing : 15 employees
Solid : 29 employees
Elite : 10 employees
DEPARTMENT SUMMARY
HR : 12 emp | Avg KPI = 83.2
Finance : 14 emp | Avg KPI = 80.9
Engineering : 14 emp | Avg KPI = 78.2
Operations : 7 emp | Avg KPI = 76.6
Marketing : 13 emp | Avg KPI = 75.1
TOP 3 PERFORMERS
#1 CO-D-EMP034 | Dept: Finance | KPI: 99.8 | Salary: IDR 12,610
#2 CO-D-EMP028 | Dept: Marketing | KPI: 99.5 | Salary: IDR 9,633
#3 CO-D-EMP040 | Dept: Finance | KPI: 99.5 | Salary: IDR 8,254
AUTOMATED REPORT COMPANY CO-E
Total Employees : 60
Avg Monthly Salary : IDR 10,629
Avg KPI Score : 77.84
Avg Perf Score : 75.60
Top Performers : 16 (26.7% of workforce)
KPI TIER BREAKDOWN
At Risk : 11 employees
Growing : 12 employees
Solid : 21 employees
Elite : 16 employees
DEPARTMENT SUMMARY
Finance : 15 emp | Avg KPI = 80.3
Marketing : 12 emp | Avg KPI = 79.8
Engineering : 11 emp | Avg KPI = 79.0
HR : 9 emp | Avg KPI = 74.8
Operations : 13 emp | Avg KPI = 74.3
TOP 3 PERFORMERS
#1 CO-E-EMP022 | Dept: Operations | KPI: 100.0 | Salary: IDR 13,084
#2 CO-E-EMP001 | Dept: Finance | KPI: 99.4 | Salary: IDR 14,914
#3 CO-E-EMP024 | Dept: Finance | KPI: 99.3 | Salary: IDR 15,818
# ============================================================
# TASK 8 (BONUS): Automated CSV export via for loop
# ============================================================
export_list <- list()
for (cid in all_companies) {
cdata <- dashboard_df |> dplyr::filter(company_id == cid)
export_list[[cid]] <- data.frame(
company_id = cid,
n_employees = nrow(cdata),
avg_salary = round(mean(cdata$salary), 0),
avg_kpi = round(mean(cdata$KPI_score), 2),
avg_performance = round(mean(cdata$performance_score), 2),
top_performers = sum(cdata$top_performer == "Yes"),
elite_count = sum(cdata$kpi_tier == "Elite"),
at_risk_count = sum(cdata$kpi_tier == "At Risk")
)
}
export_df <- dplyr::bind_rows(export_list)
write.csv(export_df, "octavia_kpi_report.csv", row.names = FALSE)
cat(" CSV exported: octavia_kpi_report.csv\n\n") CSV exported: octavia_kpi_report.csv
kable(export_df,
caption = "Task 8 (Bonus) — Automated Export: Full Company KPI Summary") |>
kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = TRUE)| company_id | n_employees | avg_salary | avg_kpi | avg_performance | top_performers | elite_count | at_risk_count |
|---|---|---|---|---|---|---|---|
| CO-A | 60 | 10281 | 73.80 | 78.05 | 13 | 13 | 15 |
| CO-B | 60 | 10114 | 71.11 | 77.75 | 9 | 9 | 21 |
| CO-C | 60 | 10538 | 75.46 | 79.04 | 14 | 14 | 13 |
| CO-D | 60 | 10373 | 78.98 | 77.19 | 10 | 10 | 6 |
| CO-E | 60 | 10629 | 77.84 | 75.60 | 16 | 16 | 11 |
# ============================================================
# TASK 8 (BONUS): FOR LOOP — auto-build one plot per company
# gridExtra assembles them into a single dashboard figure
# ============================================================
plot_list <- list()
for (cid in all_companies) {
cdata <- dashboard_df |> dplyr::filter(company_id == cid)
p <- ggplot(cdata, aes(x = kpi_tier, fill = kpi_tier)) +
geom_bar(alpha = 0.88, show.legend = FALSE) +
geom_text(stat = "count", aes(label = after_stat(count)),
vjust = -0.4, fontface = "bold", size = 3.8) +
scale_fill_manual(values = kpi_palette) +
scale_y_continuous(limits = c(0, 30)) +
labs(title = paste0("Company ", cid), x = NULL, y = "Employees") +
theme_minimal(base_size = 10) +
theme(
plot.title = element_text(face = "bold", color = "#3D2B6B", size = 11),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)
plot_list[[cid]] <- p
}
grid.arrange(grobs = plot_list, nrow = 2,
top = "Task 8 (Bonus) — Auto-Generated KPI Tier Chart per Company")generate_report() function encapsulates all company-level logic, and a for loop executes it for every company without any copy-paste repetition. The CSV export and auto-generated plot grid illustrate how this pattern scales instantly to any number of companies. In real-world analytics, this approach would allow a data team to refresh an entire portfolio-level HR report by simply re-running one loop — the defining characteristic of a reproducible, production-grade data science workflow.
# ============================================================
# SUMMARY TABLE — All 8 tasks in one view
# ============================================================
summary_df <- data.frame(
Task = c("Task 1","Task 2","Task 3","Task 4",
"Task 5","Task 6","Task 7","Task 8 "),
Concept = c(
"Dynamic Function + Input Validation",
"Nested Simulation + Discount Logic",
"5-Tier Categorization Loop",
"Multi-Company Nested Loop Generation",
"Monte Carlo Pi & Probability",
"Loop Normalization & Feature Engineering",
"KPI Dashboard — Mini Project",
"Automated Report Generation (Bonus)"
),
Key_Function = c(
"compute_formula(x, formula)",
"simulate_sales(n_sp, days) + apply_discount()",
"categorize_performance(sales_amount)",
"generate_company_data(n_co, n_emp)",
"monte_carlo_pi(n_points)",
"normalize_columns(df) + z_score(df)",
"Integrated pipeline: Tasks 4–6 combined",
"generate_report(cid, df) + for loop"
),
Visualization = c(
"Multi-line chart with log-y scale",
"Cumulative line chart per salesperson",
"Bar chart + pie chart of tier distribution",
"Scatter with OLS regression per company",
"Point plot with circle & sub-square overlay",
"Faceted histogram + violin & boxplot",
"Grouped bar + faceted scatter + area density",
"Auto text reports + grid of bar charts"
),
check.names = FALSE
)
kable(summary_df,
caption = "Practicum Week-5 — Full Task Summary",
col.names = c("Task","Concept","Key Function","Visualization")) |>
kable_styling(bootstrap_options = c("striped","hover","bordered"),
full_width = TRUE, font_size = 13)| Task | Concept | Key Function | Visualization |
|---|---|---|---|
| Task 1 | Dynamic Function + Input Validation | compute_formula(x, formula) | Multi-line chart with log-y scale |
| Task 2 | Nested Simulation + Discount Logic | simulate_sales(n_sp, days) + apply_discount() | Cumulative line chart per salesperson |
| Task 3 | 5-Tier Categorization Loop | categorize_performance(sales_amount) | Bar chart + pie chart of tier distribution |
| Task 4 | Multi-Company Nested Loop Generation | generate_company_data(n_co, n_emp) | Scatter with OLS regression per company |
| Task 5 | Monte Carlo Pi & Probability | monte_carlo_pi(n_points) | Point plot with circle & sub-square overlay |
| Task 6 | Loop Normalization & Feature Engineering | normalize_columns(df) + z_score(df) | Faceted histogram + violin & boxplot |
| Task 7 | KPI Dashboard — Mini Project | Integrated pipeline: Tasks 4–6 combined | Grouped bar + faceted scatter + area density |
| Task 8 | Automated Report Generation (Bonus) | generate_report(cid, df) + for loop | Auto text reports + grid of bar charts |
Practicum Week-5 has successfully demonstrated the full integration of functions, loops, simulation, and ggplot2 visualization in R. Eight core conclusions emerge:
apply_discount() handles business rules while simulate_sales() manages the loop structure.Together, these eight tasks confirm that reusable functions + structured loops + expressive visualization + automated reporting form the complete toolkit of a professional data scientist in R.