Assignment Week 5

Data Science Programming 1

Data Science Study Program Institut Teknologi Sains Bandung
Student
Nadia Apriani
Nadia Apriani
52250006
Student Major in Data Science
R Programming Data Science Statistics
Lecturer
Bakti Siregar
Bakti Siregar, M.Sc., CDS
 
Lecturer in Data Science
& Statistical Computing
Data Science Statistics

1 Task 1: Dynamic Multi-Formula Function


1.1 Introduction

This document explain the implementation of the function compute_formula(x, formula) which is able to calculate four formulatical formulas dynamically:

Formula Persamaan
Linear \(y = 2x + 1\)
Quadratic \(y = x^2 + 3x + 2\)
Cubic \(y = x^3 - 2x^2 + x\)
Exponential \(y = e^{0.3x}\)

The reason for choosing the range of \(x - 1\) to \(20\):

This range was chosen because it is long enough to clearly show the difference in growth between the four formulas - especially the şaliation between polynomial (cubic) and exponential growth that starts to şignify Above \(x = 10\).


1.2 Part 1 - function compute_formula()

This function accepts two parameters:

  • x: Numerical value (can be a vector)
  • Formula: The formula name string you want to calculate

In the function, there are two layers of input validation, followed by conditional logic (if-else) for choose the appropriate formula

# ============================================================
# MAIN FUNCTION: compute_formula(x, formula)
# Wraps the entire logic for formula selection and calculation
# ============================================================

compute_formula <- function(x, formula) {

  # --- VALIDATION 1: Ensure x is numeric ---
  # If x is a character or factor, stop execution
  if (!is.numeric(x)) {
    stop("Error: Argument 'x' must be a numeric value.")
  }

  # --- VALIDATION 2: Ensure formula name is recognized ---
  # List of available formulas in this function
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")

  # Normalize input: convert to lowercase & remove extra spaces
  formula <- tolower(trimws(formula))

  # Check if the formula is in the valid list
  if (!(formula %in% valid_formulas)) {
    stop(paste("Unknown formula:", formula))
  }

  # --- CONDITIONAL LOGIC: Select and calculate formula ---
  # Using if-else to determine the formula based on input 'formula'

  if (formula == "linear") {
    # Linear formula: y = 2x + 1
    # Constant growth — increases by 2 for every 1 unit increase in x
    result <- 2 * x + 1

  } else if (formula == "quadratic") {
    # Quadratic formula: y = x^2 + 3x + 2
    # Growth accelerates as x increases (parabola)
    result <- x^2 + 3 * x + 2

  } else if (formula == "cubic") {
    # Cubic formula: y = x^3 - 2x^2 + x
    # Very rapid growth — dominant for large x
    result <- x^3 - 2 * x^2 + x

  } else if (formula == "exponential") {
    # Exponential formula: y = e^(0.3x)
    # Compounded growth — the fastest among all formulas
    result <- exp(0.3 * x)
  }

  # Return the calculation result
  return(result)
}

1.3 Part 2 - Nested Loops for Count All Formula

Nested loops used so that we can process all cobination between the value of \(x\) (20 values) and type formula (4 formula) automatically in one time road-without having to rewrite the code for each combination.

# ============================================================
# NESTED LOOPS: Iterate x and formulas
# ============================================================

# Helper function for calculation
compute_formula <- function(x, type) {
  type <- tolower(type)
  if (!is.numeric(x)) stop("x must be numeric")
  valid <- c("linear", "quadratic", "cubic", "exponential")
  if (!(type %in% valid)) stop(paste("Unknown formula:", type))
  
  switch(type,
    "linear"      = 2 * x + 1,
    "quadratic"   = x^2 + 3*x + 2,
    "cubic"       = x^3 - 2*x^2 + x - 1,
    "exponential" = 2^x
  )
}

# Define inputs
x_values <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")

# Initialize results dataframe
results <- data.frame(
  x       = integer(0),
  formula = character(0),
  y       = numeric(0)
)

# --- Nested Loop Execution ---
for (x_val in x_values) {          # Loop through x values
  for (frm in formulas) {          # Loop through formulas
    y_val <- compute_formula(x_val, frm)
    results <- rbind(
      results,
      data.frame(x = x_val, formula = frm, y = y_val)
    )
  }
}

# ============================================================
# OUTPUT: Wide format table
# ============================================================
library(tidyr)
library(knitr)

# Pivot data to wide format
results_wide <- pivot_wider(
  results,
  names_from  = formula,
  values_from = y
)

# Set clean column names
colnames(results_wide) <- c("x", "Linear", "Quadratic", "Cubic", "Exponential")

# Render table
kable(results_wide, format = "simple", align = "r",
      caption = "Comparison of y Values for Each Formula")
Comparison of y Values for Each Formula
x Linear Quadratic Cubic Exponential
1 3 6 -1 2
2 5 12 1 4
3 7 20 11 8
4 9 30 35 16
5 11 42 79 32
6 13 56 149 64
7 15 72 251 128
8 17 90 391 256
9 19 110 575 512
10 21 132 809 1024
11 23 156 1099 2048
12 25 182 1451 4096
13 27 210 1871 8192
14 29 240 2365 16384
15 31 272 2939 32768
16 33 306 3599 65536
17 35 342 4351 131072
18 37 380 5201 262144
19 39 420 6155 524288
20 41 462 7219 1048576

1.4 Part 3 — Visualization: Plot All Formulas in One Graph

For this visualization, i use the plotly library so that the resulting graphics are imteractive. The benefits is that we can see the exact value from every point only with to direct cursor (hover), which is very to help remember value difference between linear formula and exponentially quite drastically at the high \(x\) figure.

Note: Value cubic an exponentially grow very much fast after \(x>10\). The Y axis is limited at **8000* so that all fixed formula looks proportional in one graph.

Note: value cubic & exponential isvery large for x > 15; the Y axis is limited to 8000

Interpretasi

This visualization clearly illustrates the contrasting growth rates of the four mathematical functions, where a significant divergence emerges once \(x\) exceeds 10. The Cubic function (\(y = x^3 - 2x^2 + x\)) demonstrates the most aggressive surge, dominating the graph by surpassing 7,000, while the Exponential function also shows rapid acceleration. In contrast, the Linear and Quadratic models remain relatively flat and stable at the bottom, underscoring how significantly higher-degree polynomials and exponential growth outperform simpler models as the input value increases; capping the Y-axis at 8,000 was essential to maintain a proportional view and highlight these drastic differences across the \(x=1\) to \(20\) range.


1.5 Part 4 - Test Input Validation

Testing is done with three scenario to make use of try catch() so that the program doesn`t stop when encountered an error.

# ============================================================
# VALIDATION TEST: Ensuring the function handles incorrect input
# ============================================================

cat(
  # Test 1: Unknown formula
  tryCatch(compute_formula(5, "logarithmic"), error = function(e) e$message),
  
  # Test 2: x is not numeric
  tryCatch(compute_formula("ten", "linear"), error = function(e) e$message),
  
  # Test 3: Capitalized input (case-insensitive)
  compute_formula(5, "LINEAR"),
  
  # Test 4: x as a vector (vectorized)
  compute_formula(c(1, 5, 10), "quadratic"),
  
  sep = "\n"
)
## Unknown formula: logarithmic
## x must be numeric
## 11
## 6
## 42
## 132

2 Task 2: Nested Simulation — Multi-Sales & Discounts

2.1 Initialisation and Function Preparation

The first step is to organise the seed so that the number randomised which produced is always consistent. Then, we define the main function simulate_sales in which there are two nested functions two calculate sales accumulation (calc_cumulative) and discount logic (get_discount).

# Initial setup to ensure consistent simulation results
set.seed(42)

simulate_sales <- function(n_salesperson, days) {

  # NESTED FUNCTION 1: Calculating daily accumulation (Loop)
  calc_cumulative <- function(sales_vec) {
    cum_total <- 0
    result    <- numeric(length(sales_vec))
    for (i in seq_along(sales_vec)) {
      cum_total  <- cum_total + sales_vec[i]
      result[i]  <- cum_total
    }
    return(result)
  }

  # NESTED FUNCTION 2: Discount logic (Conditional Logic)
  get_discount <- function(amount) {
    if      (amount >= 900) return(0.20)  # 20% Discount
    else if (amount >= 600) return(0.15)  # 15% Discount
    else if (amount >= 300) return(0.10)  # 10% Discount
    else                    return(0.05)  # 5% Discount
  }

  all_data <- data.frame()

  # OUTER LOOP: Iteration per salesperson
  for (sp in 1:n_salesperson) {
    sales_vec  <- numeric(days)
    disc_vec   <- numeric(days)

    # INNER LOOP: Daily iteration
    for (d in 1:days) {
      amt          <- round(runif(1, min = 100, max = 1000), 2)
      sales_vec[d] <- amt
      disc_vec[d]  <- get_discount(amt)
    }

    cum_vec <- calc_cumulative(sales_vec)

    # Storing results into a Data Frame
    sp_df <- data.frame(
      sales_id      = paste0("SP", sp),
      day           = 1:days,
      sales_amount  = sales_vec,
      discount_rate = disc_vec,
      cumulative    = cum_vec
    )
    all_data <- rbind(all_data, sp_df)
  }
  return(all_data)
}

2.2 Simulation Execution and Data Snapshot Dummy

After the function is ready, we run a simulation for 3 sales people for 30 days. The result of the function this is data raw that we will use for further analysisi. Here are the first 6 linies of data that succesfully generated:

## First 6 Rows of Simulation Result Data:

2.3 Trend Visualization Cumulative Sales

The graph below to point out how is the total sales every salesperson increasing every day. This visualization make it easier for us see who is to own the sharpest growth.

Interpretasi

“This visualization tracks the cumulative sales performance over a 30-day period. All three salespeople show consistent positive growth, with a clear divergence in performance becoming visible after the second week.

SP1 emerges as the top performer, maintaining the steepest growth curve to finish with the highest total. SP2 follows with competitive results, while SP3 shows a steadier, more gradual increase. Overall, this chart successfully validates our simulation’s ability to model distinct individual sales dynamics.”

3 Task 3: Multi-Level Function Performance

3.1 Inisialitation and Prepare Function

First, we define the function categorize_performance who accept a vector value sale and return the performance category fpr each element. Inside the function main, there is a nested function get_category which to handle category logic based on sales value.

# Main function: accepts a vector, returns a category vector
categorize_performance <- function(sales_vec) {
  # NESTED FUNCTION: Category logic for a single value (Conditional Logic)
  get_category <- function(val) {
    if      (val >= 900) return("Excellent")
    else if (val >= 700) return("Very Good")
    else if (val >= 500) return("Good")
    else if (val >= 300) return("Average")
    else                 return("Poor")
  }
  
  categories <- character(length(sales_vec))
  
  # LOOP: Iteration through each element of the sales vector
  for (i in seq_along(sales_vec)) {
    categories[i] <- get_category(sales_vec[i])
  }
  
  return(categories)
}

3.2 Execution and Data Snapshot

We use sales data (sales_amount) from task 2 simulation. Function categorise_performance applied on that column and the result saved as new column performance. Here are the first 6 lines:

# Data df is already available from Task 2 — simply apply the categorization function
df$performance <- categorize_performance(df$sales_amount)
head(df)

3.3 Summary Statistics Performance Category

The following table totalling that amount transaction and percentage for every performance category, so that the data distribution can be read concisely.

Performance Category Summary
Category Count Percentage
Excellent 14 15.6%
Very Good 20 22.2%
Good 19 21.1%
Average 17 18.9%
Poor 20 22.2%

3.4 Visualization: Bar chart and pie chart category distribution

the following two graphs to point out distribution of performance categories. Bar chart to show amount and percentage of each category, while pie chart to show the proportion visually.

Interpretasi

This visualization highlights a balanced distribution across five performance categories. The ‘Very Good’ and ‘Poor’ segments lead the chart, each accounting for 20 transactions (22.2%). In contrast, the ‘Excellent’ category shows the lowest frequency with 14 transactions (15.6%). These results indicate a significant opportunity to optimize lower-performing transactions and shift them toward higher performance levels.

4 Task 4: Multi-Company Dataset Simulation

4.1 Function Definition

Here, build the core “engine” through the generate_company_data function. This function is designed with nested loops: the outer loop handles company identities, while the inner loop generates data for each individual employee. It also includes conditional logic to filter employees with a KPI over 90 as “Top Performers.”

generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data <- list()

  for (comp in 1:n_company) {
    company_id <- paste0("Company_", LETTERS[comp])
    company_records <- list()

    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)

      top_performer <- ifelse(KPI_score > 90, "Yes", "No")

      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }

  final_df <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

# Generate 5 companies x 100 employees = 500 rows total
employee_data <- generate_company_data(n_company = 5, n_employees = 100)

4.2 Data Summarization

After creating the primary dataset, need to look at the big picture. Using the summarise function from dplyr, calculate the average salary, performance, and total top performers for each company. These results are stored in the company_summary table, which serves as the basis for our upcoming visualizations.

Company Performance Summary
company_id avg_salary avg_performance max_KPI top_performers
Company_A 8633.96 74.39 99.4 20
Company_B 8344.15 74.04 99.5 23
Company_C 8824.15 74.95 98.2 15
Company_D 9178.27 73.30 100.0 14
Company_E 8935.53 76.11 100.0 20

4.3 Visualizations

4.3.1 Bar Chart: Salary Comparison

This sub-heading focuses on comparing the average salary across different companies. By using plot_ly, we create a bar chart equipped with informative tooltips. This feature allows the audience to see details such as company name, average salary, and the count of top performers simply by hovering over the bars.

4.3.2 Boxplot: KPI Distribution

This section presents the KPI score distribution using a boxplot. This visualization is crucial for observing data spread, medians, and outliers. We also added a red threshold line at 90 to provide a clear visual indication of where the “Top Performer” category begins.

Interpretasi

  1. Average Salary Comparison

This visualization compares the average financial compensation provided by five different companies. The primary focus is to observe the variance in wage standards across entities and how these figures correlate with the number of high-achieving employees (Top Performers) within each organization.

  1. KPI Distribution & Top Performer Threshold

This chart maps the distribution of employee KPI scores to measure consistency and collective performance quality. The use of a threshold line at 90 serves as a critical indicator to distinguish the general employee population from those with exceptional productivity in each company.

5 Task 5: Monte Carlo Simulation: Pi & Probability

5.1 Task Description and Mathematical Concept

This task uses the Monte Carlo method to estimate the value of \(\pi\) and analyze probabilities. The core concept is placing a unit circle inside a \(2 \times 2\) square. The ratio of the circle’s area to the square’s area is \(\pi/4\). By generating random points, we can calculate: \[\pi \approx 4 \times \frac{\text{points inside circle}}{\text{total points}}\]

5.2 Integrated Function Definition

In this section, build the monte_carlo_pi function, which integrates the entire simulation workflow. The function operates as follows:

  • Data Generation: Uses runif to create random \((x, y)\) coordinates.
  • Geometric Logic: Utilizes the Pythagorean theorem to filter points located inside the circle (\(x^2 + y^2 \leq 1\)).
  • Iteration & Analysis: Employs loops to monitor accuracy changes (convergence) and calculate probabilities within specific sub-areas.
  • Visualization: Generates two plots simultaneously: the spatial distribution of points and the stability trend of the \(\pi\) value as the sample size increases.
monte_carlo_pi <- function(n_points) {
  
  # --- STEP 1: Generate Points ---
  set.seed(42)
  x <- runif(n_points, min = -1, max = 1)
  y <- runif(n_points, min = -1, max = 1)
  
  # --- STEP 2: Geometric Logic ---
  distance_sq <- x^2 + y^2
  inside_circle <- distance_sq <= 1
  
  # --- STEP 3: Pi Estimation ---
  pi_estimate <- 4 * sum(inside_circle) / n_points
  
  # --- STEP 4: Convergence Loop ---
  cat("=== π Convergence Over Iterations ===\n")
  checkpoints <- c(100, 500, 1000, 5000, n_points)
  for (n in checkpoints) {
    if (n <= n_points) {
      in_circle_n <- sum(x[1:n]^2 + y[1:n]^2 <= 1)
      pi_n        <- 4 * in_circle_n / n
      error_n     <- abs(pi_n - pi)
      cat(sprintf("  n = %6d | Estimate = %.6f | Error = %.6f\n", n, pi_n, error_n))
    }
  }
  
  # --- STEP 5: Probability Analysis ---
  in_subsquare     <- (x >= 0 & x <= 0.5) & (y >= 0 & y <= 0.5)
  prob_empirical   <- sum(in_subsquare) / n_points
  prob_theoretical <- (0.5 * 0.5) / (2 * 2)
  cat("\n=== Sub-square Probability [0, 0.5] ===\n")
  cat(sprintf("  Theoretical: %.6f | Empirical: %.6f\n", prob_theoretical, prob_empirical))
  
  # --- STEP 6: Build Plotly Objects ---
  plot_n       <- min(n_points, 3000)
  point_colors <- ifelse(inside_circle[1:plot_n], "#2196F3", "#F44336")
  point_labels <- ifelse(inside_circle[1:plot_n], "Inside Circle", "Outside Circle")
  
  theta    <- seq(0, 2 * pi, length.out = 500)
  circle_x <- cos(theta)
  circle_y <- sin(theta)
  
  p1 <- plot_ly() %>%
    add_trace(
      x = x[1:plot_n], y = y[1:plot_n],
      type = "scatter", mode = "markers",
      marker = list(color = point_colors, size = 3, opacity = 0.7),
      text = paste0(point_labels,
                    "<br>x = ", round(x[1:plot_n], 4),
                    "<br>y = ", round(y[1:plot_n], 4)),
      hoverinfo = "text",
      name = "Sample Points"
    ) %>%
    add_trace(
      x = circle_x, y = circle_y,
      type = "scatter", mode = "lines",
      line = list(color = "black", width = 2),
      hoverinfo = "none",
      name = "Unit Circle"
    ) %>%
    add_trace(
      x = c(0, 0.5, 0.5, 0, 0),
      y = c(0, 0, 0.5, 0.5, 0),
      type = "scatter", mode = "lines",
      line = list(color = "#FF9800", width = 2, dash = "dash"),
      hoverinfo = "none",
      name = "Sub-square [0–0.5]"
    ) %>%
    layout(
      title  = list(text = paste0("<b>Spatial Distribution</b> (n = ", n_points, ")"),
                    font = list(size = 14)),
      xaxis  = list(title = "X", range = c(-1.1, 1.1), scaleanchor = "y"),
      yaxis  = list(title = "Y", range = c(-1.1, 1.1)),
      legend = list(orientation = "h", y = -0.15),
      hovermode = "closest"
    )
  
  iter_seq     <- round(seq(100, n_points, length.out = 200))
  pi_estimates <- numeric(length(iter_seq))
  for (i in seq_along(iter_seq)) {
    n_i            <- iter_seq[i]
    in_c_i         <- sum(x[1:n_i]^2 + y[1:n_i]^2 <= 1)
    pi_estimates[i] <- 4 * in_c_i / n_i
  }
  errors <- abs(pi_estimates - pi)
  
  p2 <- plot_ly() %>%
    add_trace(
      x = iter_seq, y = pi_estimates,
      type = "scatter", mode = "lines",
      line = list(color = "#3F51B5", width = 1.5),
      text = paste0("n = ", iter_seq,
                    "<br>π̂ = ", round(pi_estimates, 6),
                    "<br>Error = ", round(errors, 6)),
      hoverinfo = "text",
      name = "Estimated π"
    ) %>%
    add_trace(
      x = c(min(iter_seq), max(iter_seq)),
      y = c(pi, pi),
      type = "scatter", mode = "lines",
      line = list(color = "#E53935", width = 2, dash = "dash"),
      hoverinfo = "none",
      name = paste0("True π = ", round(pi, 6))
    ) %>%
    layout(
      title     = list(text = "<b>Stability Trend of π</b>", font = list(size = 14)),
      xaxis     = list(title = "Sample Size (n)"),
      yaxis     = list(title = "Estimated π", range = c(pi - 0.3, pi + 0.3)),
      legend    = list(orientation = "h", y = -0.15),
      hovermode = "x unified"
    )
  
  invisible(list(
    pi_estimate    = pi_estimate,
    prob_empirical = prob_empirical,
    plot_spatial   = p1,        
    plot_converge  = p2
  ))
}

5.3 Execution and Result Visualization

he final step is to execute the simulation function. We adjust the graphical parameters using par(mfrow) so that both visualization results appear side-by-side.

## === π Convergence Over Iterations ===
##   n =    100 | Estimate = 2.920000 | Error = 0.221593
##   n =    500 | Estimate = 3.056000 | Error = 0.085593
##   n =   1000 | Estimate = 3.064000 | Error = 0.077593
##   n =   5000 | Estimate = 3.112800 | Error = 0.028793
##   n =  10000 | Estimate = 3.127200 | Error = 0.014393
## 
## === Sub-square Probability [0, 0.5] ===
##   Theoretical: 0.062500 | Empirical: 0.062900

Interpretation

  1. Spatial Distribution Analysis

This plot displays 10,000 random points demonstrating the geometric estimation of \(\pi\). The ratio of blue points (inside the unit circle) to red points (outside) represents the value of \(\pi/4\). The orange dashed sub-square serves as a validation tool, where the empirical probability of 0.0629 closely aligns with the theoretical value of 0.0625, proving that the points are uniformly and accurately distributed.

2.Stability Trend Analysis

This graph visualizes the Law of Large Numbers. At lower sample sizes, the estimated \(\pi\) (blue line) shows high volatility due to variance. However, as the sample size increases toward \(n = 10,000\), the estimation stabilizes and converges toward the true value of \(\pi\) (red dashed line). This trend confirms that the Monte Carlo method becomes significantly more reliable and precise as the number of iterations grows.

6 Task 6:Advanced Data Transformation & Feature Engineering

6.1 Library and dataset preparation

First, we need to load our essential libraries: dplyr for data manipulation, ggplot2 for visualization, and tidyr for tidying up data structures. We’ll also generate a synthetic employee dataset to work with.

Initial Dataset Structure
Variable Type Sample
employee_id employee_id integer 1, 2, 3
age age integer 58, 22, 46
salary salary numeric 70000, 65200, 12700
experience experience integer 13, 18, 21
performance performance numeric 7.3, 1.4, 9.2
department department character IT, IT, IT

6.2 Normalisation and Standarisation

In data analysis, it’s often necessary to bring different columns to the same scale. We’ll create two functions: Min-Max Normalization to scale values between \([0, 1]\), and Z-Score Standardization to transform the distribution to have a mean of 0 and a standard deviation of 1.

# FUNCTION 1: Min-Max Normalization [0, 1]
normalize_columns <- function(df) {
  non_num <- df[, !sapply(df, is.numeric), drop = FALSE]
  num_df  <- df[,  sapply(df, is.numeric), drop = FALSE]
  for (col in names(num_df)) {
    x             <- num_df[[col]]
    num_df[[col]] <- (x - min(x)) / (max(x) - min(x))
  }
  cbind(non_num, num_df)
}

# FUNCTION 2: Z-Score Standardization
z_score <- function(df) {
  non_num <- df[, !sapply(df, is.numeric), drop = FALSE]
  num_df  <- df[,  sapply(df, is.numeric), drop = FALSE]
  for (col in names(num_df)) {
    x             <- num_df[[col]]
    num_df[[col]] <- (x - mean(x)) / sd(x)
  }
  cbind(non_num, num_df)
}

# Apply functions
df_norm   <- normalize_columns(df)
df_zscore <- z_score(df)

# Build output tables
norm_summary <- as.data.frame(t(round(as.numeric(summary(df_norm$salary)), 6)))
colnames(norm_summary) <- c("Min", "Q1", "Median", "Mean", "Q3", "Max")

zscore_summary <- as.data.frame(t(round(as.numeric(summary(df_zscore$salary)), 6)))
colnames(zscore_summary) <- c("Min", "Q1", "Median", "Mean", "Q3", "Max")

# OUTPUT
knitr::kable(norm_summary, format = "simple", align = "c",
             caption = "Salary Statistics After Normalization")
Salary Statistics After Normalization
Min Q1 Median Mean Q3 Max
0 0.443611 0.547222 0.548767 0.6625 1
knitr::kable(zscore_summary, format = "simple", align = "c",
             caption = "Salary Statistics After Z-Score")
Salary Statistics After Z-Score
Min Q1 Median Mean Q3 Max
-3.283284 -0.629148 -0.00924 0 0.680469 2.69974

6.3 Feature Engineering

This section is where we build new columns based on business logic. We will categorize performance levels, create salary brackets, flag senior employees, and bin ages into specific groups.

# FUNCTION 3: create_features(df)
create_features <- function(df) {
  df <- df |>
    mutate(
      # Feature 1: Performance category
      performance_category = case_when(
        performance >= 8  ~ "Excellent",
        performance >= 6  ~ "Good",
        performance >= 4  ~ "Average",
        TRUE               ~ "Poor"
      ),

      # Feature 2: Salary bracket
      salary_bracket = case_when(
        salary >= 80000  ~ "High",
        salary >= 50000  ~ "Mid",
        salary >= 30000  ~ "Low",
        TRUE             ~ "Very Low"
      ),

      # Feature 3: Seniority flag
      is_senior = ifelse(experience >= 10, "Senior", "Junior"),

      # Feature 4: Age group
      age_group = cut(age,
                      breaks = c(20, 30, 40, 50, Inf),
                      labels = c("20s", "30s", "40s", "50s+"),
                      right  = FALSE)
    )
  return(df)
}

# Apply function
df_feat <- create_features(df)

# Build output tables
perf_df <- as.data.frame(table(df_feat$performance_category))
colnames(perf_df) <- c("Category", "Count")

salary_df <- as.data.frame(table(df_feat$salary_bracket))
colnames(salary_df) <- c("Bracket", "Count")

# OUTPUT
knitr::kable(perf_df, format = "simple", align = "c",
             caption = "Performance Category Distribution")
Performance Category Distribution
Category Count
Average 49
Excellent 49
Good 46
Poor 56
knitr::kable(salary_df, format = "simple", align = "c",
             caption = "Salary Bracket Distribution")
Salary Bracket Distribution
Bracket Count
High 8
Low 73
Mid 111
Very Low 8

6.4 Analisis Distribusi dan Skewness

We’ll use a loop to compare statistics before and after standardization, and check if any columns exhibit extreme skewness that might require further transformation.

Distribution Comparison: Before vs After Z-Score
Column Before_Mean Before_SD ZScore_Mean ZScore_SD Skewness_Note
SALARY 54189.00 15042.56 0 1 ✓ Normal: -0.11
AGE 42.23 11.41 0 1 ✓ Normal: -0.22
EXPERIENCE 19.89 10.29 0 1 ✓ Normal: -0.2
PERFORMANCE 5.60 2.60 0 1 ✓ Normal: -0.19

6.5 Data Visualization

Visualizations help us intuitively understand how the data has changed. We’ll create a histogram to observe the effect of the Z-Score, a boxplot to relate salary to performance, and a comparison for Min-Max normalization.

Interpretasi

  1. Salary Distribution: Original vs. Z-Score

The Z-Score transformation successfully mapped salaries into a standardized scale (mean of 0, SD of 1) without altering the distribution’s shape. This is highly effective for normalizing data with wide value ranges, ensuring that machine learning models do not assign disproportionate weight to specific variables based solely on their scale.

  1. Salary per Performance Category

This chart shows a relatively consistent salary variation across all performance categories. However, the presence of an outlier in the “Excellent” group reveals a data anomaly: some high-performing employees are receiving compensation below the average. This indicates that performance is not the sole primary determinant of salary levels in this dataset.

  1. Age Distribution: Original vs. Min-Max

The Min-Max Normalization method successfully mapped age data into a range of \([0, 1]\) while maintaining the integrity of the original frequency distribution. This step is crucial for nsuring that data processing algorithms can identify age patterns more stably and efficiently, without being affected by differences in numerical units.

7 Task 7: Mini Project: Company KPI Dashboard & Simulation

7.1 Environment Setup & Data Generation

The initial stage begins by loading the necessary libraries for data processing and visualization. A simulation function named generate_company_data is defined to create a synthetic dataset. This function implements nested loops to generate profiles for 150 employees across 7 different companies, resulting in a total of 1,050 rows of data automatically.

library(dplyr)
library(plotly)

# Set seed for reproducibility
set.seed(42)

# Function to generate simulated multi-company data
generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data    <- list()
  
  for (comp in 1:n_company) {
    company_id      <- paste0("Company_", LETTERS[comp])
    company_records <- list()
    
    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)
      top_performer     <- ifelse(KPI_score > 90, "Yes", "No")
      
      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }
  
  final_df           <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

# Execute data generation
employee_data <- generate_company_data(n_company = 5, n_employees = 150)

7.2 KPI Tier Categorization

The categorization process is performed using a combination of loops and conditional logic to sort each employee into four distinct KPI tiers: Elite, High Performer, Average, and Needs Improvement. This conditional logic enables precise classification based on the previously generated KPI scores.

# Initialize character vector for tiers
kpi_tier <- character(nrow(employee_data))

# Loop through each row to determine the tier
for (i in 1:nrow(employee_data)) {
  score <- employee_data$KPI_score[i]
  
  if (score > 90) {
    kpi_tier[i] <- "Elite"
  } else if (score >= 75) {
    kpi_tier[i] <- "High Performer"
  } else if (score >= 60) {
    kpi_tier[i] <- "Average"
  } else {
    kpi_tier[i] <- "Needs Improvement"
  }
}

# Assign the new column to the main dataframe
employee_data$kpi_tier <- kpi_tier

7.3 Statistical Summaries

The data is then processed to generate statistical summaries at both company and department levels. Key metrics such as average salary, performance, and the number of top performers are calculated using dplyr aggregation functions to provide a comprehensive overview for management.

# Company-level summary
company_summary <- employee_data %>%
  group_by(company_id) %>%
  summarise(
    avg_salary      = round(mean(salary), 2),
    avg_KPI         = round(mean(KPI_score), 2),
    avg_performance = round(mean(performance_score), 2),
    top_performers  = sum(top_performer == "Yes"),
    elite_count     = sum(kpi_tier == "Elite"),
    total_employees = n(),
    .groups = "drop"
  )

# Department-level analysis
dept_summary <- employee_data %>%
  group_by(department) %>%
  summarise(
    avg_salary      = round(mean(salary), 2),
    avg_KPI         = round(mean(KPI_score), 2),
    total_employees = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_KPI))

7.4 Visualization

Data visualizations are developed using plotly to provide an interactive aspect to the report. This section includes a Grouped Bar Chart for salary and KPI comparison, as well as a Scatter Plot equipped with linear regression lines per company to visually analyze correlations between variables.

Interpretasi

  1. Salary & KPI Comparison per Company

Average salaries across all companies remain relatively stable between $8,000 and $9,500, with Company_D recording the highest. However, there is an extreme scale disparity in the visualization, where KPI values (0–100) appear insignificant when compared directly against salary figures that reach thousands of dollars.

  1. KPI Score vs Salary Correlation

The data reveals no significant correlation between performance (KPI) and compensation (Salary). This is evidenced by the flat trend lines and randomized data points, indicating that employee performance improvements are not yet a primary factor in salary adjustments across these companies.

8 Task 8: Automated Report Generation

8.1 Library Initialization and Data Simulation

This section loads the dplyr library for data manipulation and plotly for interactive charts. We also define a function to automatically simulate employee data for multiple companies.

library(dplyr)
library(plotly)
library(htmltools)

set.seed(42)

generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data    <- list()
  
  for (comp in 1:n_company) {
    company_id      <- paste0("Company_", LETTERS[comp])
    company_records <- list()
    
    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)
      top_performer     <- ifelse(KPI_score > 90, "Yes", "No")
      
      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }
  
  final_df           <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

employee_data <- generate_company_data(n_company = 5, n_employees = 150)


employee_data$kpi_tier <- as.character(cut(employee_data$KPI_score,
                             breaks = c(0, 60, 75, 90, 100),
                             labels = c("Needs Improvement", "Average", "High Performer", "Elite"),
                             include.lowest = TRUE))

8.2 Summary Statistics and HTML Table Builder

In this section, we create functions to calculate the company’s key statistics and a helper function to convert data frames into clean HTML table formats.

generate_summary_stats <- function(df_company) {
  list(
    total_employees = nrow(df_company),
    avg_salary      = round(mean(df_company$salary), 2),
    max_salary      = round(max(df_company$salary), 2),
    min_salary      = round(min(df_company$salary), 2),
    avg_KPI         = round(mean(df_company$KPI_score), 2),
    avg_performance = round(mean(df_company$performance_score), 2),
    top_performers  = sum(df_company$top_performer == "Yes"),
    elite_count     = sum(df_company$kpi_tier == "Elite"),
    hi_count        = sum(df_company$kpi_tier == "High Performer"),
    avg_count       = sum(df_company$kpi_tier == "Average"),
    low_count       = sum(df_company$kpi_tier == "Needs Improvement")
  )
}

8.3 Main Execution Loop

This is the final step to automatically execute the entire process for all registered companies.

export_csv <- function(df_company, comp_name, output_dir = ".") {
  file_name <- file.path(output_dir, paste0("Data_", comp_name, ".csv"))
  write.csv(df_company, file_name, row.names = FALSE)
  message("CSV exported : ", file_name)
  invisible(file_name)
}

8.4 Visualization

This function generates interactive visualizations: a salary bar chart, a KPI distribution pie chart, and a scatter plot showing the relationship between KPI and salary.

8.5 Report: Company_A

  • Total Employees : 150
  • Avg Salary : $ 8,783.24
  • Avg KPI : 70.66
  • Top Performers : 32
Company_A — Department Summary
department Employees Avg_Salary Avg_KPI Top_Performers
Engineering 26 9017.61 67.11 8
Finance 34 8126.53 71.64 6
HR 34 9020.41 70.74 6
Marketing 26 8426.28 71.80 6
Operations 30 9364.97 71.55 6

8.6 Report: Company_B

  • Total Employees : 150
  • Avg Salary : $ 8,418.26
  • Avg KPI : 70.56
  • Top Performers : 26
Company_B — Department Summary
department Employees Avg_Salary Avg_KPI Top_Performers
Engineering 28 8951.44 69.21 3
Finance 29 8189.87 73.97 8
HR 34 8432.99 74.42 6
Marketing 31 8771.35 69.66 7
Operations 28 7712.84 64.69 2

8.7 Report: Company_C

  • Total Employees : 150
  • Avg Salary : $ 9,097.39
  • Avg KPI : 70.41
  • Top Performers : 25
Company_C — Department Summary
department Employees Avg_Salary Avg_KPI Top_Performers
Engineering 38 9003.82 74.77 9
Finance 25 8291.09 70.78 4
HR 34 9575.61 68.77 3
Marketing 30 9314.98 68.02 6
Operations 23 9137.64 68.37 3

8.8 Report: Company_D

  • Total Employees : 150
  • Avg Salary : $ 9,398.08
  • Avg KPI : 69.13
  • Top Performers : 24
Company_D — Department Summary
department Employees Avg_Salary Avg_KPI Top_Performers
Engineering 24 9467.59 70.79 4
Finance 30 10891.77 72.67 8
HR 25 7707.59 69.72 4
Marketing 39 8701.53 68.10 5
Operations 32 10115.24 65.34 3

8.9 Report: Company_E

  • Total Employees : 150
  • Avg Salary : $ 9,324.65
  • Avg KPI : 71.53
  • Top Performers : 26
Company_E — Department Summary
department Employees Avg_Salary Avg_KPI Top_Performers
Engineering 28 9324.51 71.44 7
Finance 27 9584.96 74.61 5
HR 33 8833.76 71.15 5
Marketing 27 8849.06 71.36 4
Operations 35 9953.68 69.73 5

Interpretasi

  1. Average Salary per Company (Bar Chart) Rata-rata gaji di seluruh perusahaan berada pada rentang kompetitif $8.000 hingga $10.000, dengan Company D sebagai pembayar tertinggi dan Company B terendah, yang mengindikasikan adanya potensi inefisiensi biaya jika tidak dibarengi dengan performa yang sebanding.

  2. KPI Tier Distribution (Donut Chart) Kualitas SDM secara agregat menunjukkan kondisi kritis karena kategori Needs Improvement mendominasi sebesar 32,4%, sementara kelompok performa tinggi (Elite & High Performer) hanya mencapai 44,1%, sehingga diperlukan evaluasi strategi rekrutmen dan pengembangan karyawan.

  3. Average KPI Score per Company (Line Chart) Tren performa rata-rata di semua perusahaan cenderung stagnan di angka 70-an tanpa adanya perbedaan signifikan antar entitas, mempertegas temuan bahwa besaran gaji (khususnya pada Company D) belum berkorelasi positif terhadap peningkatan skor KPI.