Assignment Week 5

Data Science Programming 1

1 Task 1: Dynamic Multi-Formula Function

1.1 Introduction

This document explain the implementation of the function compute_formula(x, formula) which is able to calculate four formulatical formulas dynamically:

Formula	Persamaan
Linear	$y = 2x + 1$
Quadratic	$y = x^2 + 3x + 2$
Cubic	$y = x^3 - 2x^2 + x$
Exponential	$y = e^{0.3x}$

The reason for choosing the range of $x - 1$ to $20$:

This range was chosen because it is long enough to clearly show the difference in growth between the four formulas - especially the şaliation between polynomial (cubic) and exponential growth that starts to şignify Above $x = 10$.

1.2 Part 1 - function `compute_formula()`

This function accepts two parameters:

x: Numerical value (can be a vector)
Formula: The formula name string you want to calculate

In the function, there are two layers of input validation, followed by conditional logic (if-else) for choose the appropriate formula

# ============================================================
# MAIN FUNCTION: compute_formula(x, formula)
# Wraps the entire logic for formula selection and calculation
# ============================================================

compute_formula <- function(x, formula) {

  # --- VALIDATION 1: Ensure x is numeric ---
  # If x is a character or factor, stop execution
  if (!is.numeric(x)) {
    stop("Error: Argument 'x' must be a numeric value.")
  }

  # --- VALIDATION 2: Ensure formula name is recognized ---
  # List of available formulas in this function
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")

  # Normalize input: convert to lowercase & remove extra spaces
  formula <- tolower(trimws(formula))

  # Check if the formula is in the valid list
  if (!(formula %in% valid_formulas)) {
    stop(paste("Unknown formula:", formula))
  }

  # --- CONDITIONAL LOGIC: Select and calculate formula ---
  # Using if-else to determine the formula based on input 'formula'

  if (formula == "linear") {
    # Linear formula: y = 2x + 1
    # Constant growth — increases by 2 for every 1 unit increase in x
    result <- 2 * x + 1

  } else if (formula == "quadratic") {
    # Quadratic formula: y = x^2 + 3x + 2
    # Growth accelerates as x increases (parabola)
    result <- x^2 + 3 * x + 2

  } else if (formula == "cubic") {
    # Cubic formula: y = x^3 - 2x^2 + x
    # Very rapid growth — dominant for large x
    result <- x^3 - 2 * x^2 + x

  } else if (formula == "exponential") {
    # Exponential formula: y = e^(0.3x)
    # Compounded growth — the fastest among all formulas
    result <- exp(0.3 * x)
  }

  # Return the calculation result
  return(result)
}

1.3 Part 2 - Nested Loops for Count All Formula

Nested loops used so that we can process all cobination between the value of $x$ (20 values) and type formula (4 formula) automatically in one time road-without having to rewrite the code for each combination.

# ============================================================
# NESTED LOOPS: Iterate x and formulas
# ============================================================

# Helper function for calculation
compute_formula <- function(x, type) {
  type <- tolower(type)
  if (!is.numeric(x)) stop("x must be numeric")
  valid <- c("linear", "quadratic", "cubic", "exponential")
  if (!(type %in% valid)) stop(paste("Unknown formula:", type))
  
  switch(type,
    "linear"      = 2 * x + 1,
    "quadratic"   = x^2 + 3*x + 2,
    "cubic"       = x^3 - 2*x^2 + x - 1,
    "exponential" = 2^x
  )
}

# Define inputs
x_values <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")

# Initialize results dataframe
results <- data.frame(
  x       = integer(0),
  formula = character(0),
  y       = numeric(0)
)

# --- Nested Loop Execution ---
for (x_val in x_values) {          # Loop through x values
  for (frm in formulas) {          # Loop through formulas
    y_val <- compute_formula(x_val, frm)
    results <- rbind(
      results,
      data.frame(x = x_val, formula = frm, y = y_val)
    )
  }
}

# ============================================================
# OUTPUT: Wide format table
# ============================================================
library(tidyr)
library(knitr)

# Pivot data to wide format
results_wide <- pivot_wider(
  results,
  names_from  = formula,
  values_from = y
)

# Set clean column names
colnames(results_wide) <- c("x", "Linear", "Quadratic", "Cubic", "Exponential")

# Render table
kable(results_wide, format = "simple", align = "r",
      caption = "Comparison of y Values for Each Formula")

Comparison of y Values for Each Formula
x	Linear	Quadratic	Cubic	Exponential
1	3	6	-1	2
2	5	12	1	4
3	7	20	11	8
4	9	30	35	16
5	11	42	79	32
6	13	56	149	64
7	15	72	251	128
8	17	90	391	256
9	19	110	575	512
10	21	132	809	1024
11	23	156	1099	2048
12	25	182	1451	4096
13	27	210	1871	8192
14	29	240	2365	16384
15	31	272	2939	32768
16	33	306	3599	65536
17	35	342	4351	131072
18	37	380	5201	262144
19	39	420	6155	524288
20	41	462	7219	1048576

1.4 Part 3 — Visualization: Plot All Formulas in One Graph

For this visualization, i use the plotly library so that the resulting graphics are imteractive. The benefits is that we can see the exact value from every point only with to direct cursor (hover), which is very to help remember value difference between linear formula and exponentially quite drastically at the high $x$ figure.

Note: Value cubic an exponentially grow very much fast after $x>10$. The Y axis is limited at **8000* so that all fixed formula looks proportional in one graph.

Note: value cubic & exponential isvery large for x > 15; the Y axis is limited to 8000

Interpretasi

This visualization clearly illustrates the contrasting growth rates of the four mathematical functions, where a significant divergence emerges once $x$ exceeds 10. The Cubic function ($y = x^3 - 2x^2 + x$) demonstrates the most aggressive surge, dominating the graph by surpassing 7,000, while the Exponential function also shows rapid acceleration. In contrast, the Linear and Quadratic models remain relatively flat and stable at the bottom, underscoring how significantly higher-degree polynomials and exponential growth outperform simpler models as the input value increases; capping the Y-axis at 8,000 was essential to maintain a proportional view and highlight these drastic differences across the $x=1$ to $20$ range.

1.5 Part 4 - Test Input Validation

Testing is done with three scenario to make use of try catch() so that the program doesn`t stop when encountered an error.

# ============================================================
# VALIDATION TEST: Ensuring the function handles incorrect input
# ============================================================

cat(
  # Test 1: Unknown formula
  tryCatch(compute_formula(5, "logarithmic"), error = function(e) e$message),
  
  # Test 2: x is not numeric
  tryCatch(compute_formula("ten", "linear"), error = function(e) e$message),
  
  # Test 3: Capitalized input (case-insensitive)
  compute_formula(5, "LINEAR"),
  
  # Test 4: x as a vector (vectorized)
  compute_formula(c(1, 5, 10), "quadratic"),
  
  sep = "\n"
)

## Unknown formula: logarithmic
## x must be numeric
## 11
## 6
## 42
## 132

2 Task 2: Nested Simulation — Multi-Sales & Discounts

2.1 Initialisation and Function Preparation

The first step is to organise the seed so that the number randomised which produced is always consistent. Then, we define the main function simulate_sales in which there are two nested functions two calculate sales accumulation (calc_cumulative) and discount logic (get_discount).

# Initial setup to ensure consistent simulation results
set.seed(42)

simulate_sales <- function(n_salesperson, days) {

  # NESTED FUNCTION 1: Calculating daily accumulation (Loop)
  calc_cumulative <- function(sales_vec) {
    cum_total <- 0
    result    <- numeric(length(sales_vec))
    for (i in seq_along(sales_vec)) {
      cum_total  <- cum_total + sales_vec[i]
      result[i]  <- cum_total
    }
    return(result)
  }

  # NESTED FUNCTION 2: Discount logic (Conditional Logic)
  get_discount <- function(amount) {
    if      (amount >= 900) return(0.20)  # 20% Discount
    else if (amount >= 600) return(0.15)  # 15% Discount
    else if (amount >= 300) return(0.10)  # 10% Discount
    else                    return(0.05)  # 5% Discount
  }

  all_data <- data.frame()

  # OUTER LOOP: Iteration per salesperson
  for (sp in 1:n_salesperson) {
    sales_vec  <- numeric(days)
    disc_vec   <- numeric(days)

    # INNER LOOP: Daily iteration
    for (d in 1:days) {
      amt          <- round(runif(1, min = 100, max = 1000), 2)
      sales_vec[d] <- amt
      disc_vec[d]  <- get_discount(amt)
    }

    cum_vec <- calc_cumulative(sales_vec)

    # Storing results into a Data Frame
    sp_df <- data.frame(
      sales_id      = paste0("SP", sp),
      day           = 1:days,
      sales_amount  = sales_vec,
      discount_rate = disc_vec,
      cumulative    = cum_vec
    )
    all_data <- rbind(all_data, sp_df)
  }
  return(all_data)
}

2.2 Simulation Execution and Data Snapshot Dummy

After the function is ready, we run a simulation for 3 sales people for 30 days. The result of the function this is data raw that we will use for further analysisi. Here are the first 6 linies of data that succesfully generated:

## First 6 Rows of Simulation Result Data:

2.3 Trend Visualization Cumulative Sales

The graph below to point out how is the total sales every salesperson increasing every day. This visualization make it easier for us see who is to own the sharpest growth.

Interpretasi

“This visualization tracks the cumulative sales performance over a 30-day period. All three salespeople show consistent positive growth, with a clear divergence in performance becoming visible after the second week.

SP1 emerges as the top performer, maintaining the steepest growth curve to finish with the highest total. SP2 follows with competitive results, while SP3 shows a steadier, more gradual increase. Overall, this chart successfully validates our simulation’s ability to model distinct individual sales dynamics.”

3 Task 3: Multi-Level Function Performance

3.1 Inisialitation and Prepare Function

First, we define the function categorize_performance who accept a vector value sale and return the performance category fpr each element. Inside the function main, there is a nested function get_category which to handle category logic based on sales value.

# Main function: accepts a vector, returns a category vector
categorize_performance <- function(sales_vec) {
  # NESTED FUNCTION: Category logic for a single value (Conditional Logic)
  get_category <- function(val) {
    if      (val >= 900) return("Excellent")
    else if (val >= 700) return("Very Good")
    else if (val >= 500) return("Good")
    else if (val >= 300) return("Average")
    else                 return("Poor")
  }
  
  categories <- character(length(sales_vec))
  
  # LOOP: Iteration through each element of the sales vector
  for (i in seq_along(sales_vec)) {
    categories[i] <- get_category(sales_vec[i])
  }
  
  return(categories)
}

3.2 Execution and Data Snapshot

We use sales data (sales_amount) from task 2 simulation. Function categorise_performance applied on that column and the result saved as new column performance. Here are the first 6 lines:

# Data df is already available from Task 2 — simply apply the categorization function
df$performance <- categorize_performance(df$sales_amount)
head(df)

3.3 Summary Statistics Performance Category

The following table totalling that amount transaction and percentage for every performance category, so that the data distribution can be read concisely.

Performance Category Summary
Category	Count	Percentage
Excellent	14	15.6%
Very Good	20	22.2%
Good	19	21.1%
Average	17	18.9%
Poor	20	22.2%

3.4 Visualization: Bar chart and pie chart category distribution

the following two graphs to point out distribution of performance categories. Bar chart to show amount and percentage of each category, while pie chart to show the proportion visually.

Interpretasi

This visualization highlights a balanced distribution across five performance categories. The ‘Very Good’ and ‘Poor’ segments lead the chart, each accounting for 20 transactions (22.2%). In contrast, the ‘Excellent’ category shows the lowest frequency with 14 transactions (15.6%). These results indicate a significant opportunity to optimize lower-performing transactions and shift them toward higher performance levels.

4 Task 4: Multi-Company Dataset Simulation

4.1 Function Definition

Here, build the core “engine” through the generate_company_data function. This function is designed with nested loops: the outer loop handles company identities, while the inner loop generates data for each individual employee. It also includes conditional logic to filter employees with a KPI over 90 as “Top Performers.”

generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data <- list()

  for (comp in 1:n_company) {
    company_id <- paste0("Company_", LETTERS[comp])
    company_records <- list()

    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)

      top_performer <- ifelse(KPI_score > 90, "Yes", "No")

      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }

  final_df <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

# Generate 5 companies x 100 employees = 500 rows total
employee_data <- generate_company_data(n_company = 5, n_employees = 100)

4.2 Data Summarization

After creating the primary dataset, need to look at the big picture. Using the summarise function from dplyr, calculate the average salary, performance, and total top performers for each company. These results are stored in the company_summary table, which serves as the basis for our upcoming visualizations.

Company Performance Summary
company_id	avg_salary	avg_performance	max_KPI	top_performers
Company_A	8633.96	74.39	99.4	20
Company_B	8344.15	74.04	99.5	23
Company_C	8824.15	74.95	98.2	15
Company_D	9178.27	73.30	100.0	14
Company_E	8935.53	76.11	100.0	20

4.3 Visualizations

4.3.1 Bar Chart: Salary Comparison

This sub-heading focuses on comparing the average salary across different companies. By using plot_ly, we create a bar chart equipped with informative tooltips. This feature allows the audience to see details such as company name, average salary, and the count of top performers simply by hovering over the bars.

4.3.2 Boxplot: KPI Distribution

This section presents the KPI score distribution using a boxplot. This visualization is crucial for observing data spread, medians, and outliers. We also added a red threshold line at 90 to provide a clear visual indication of where the “Top Performer” category begins.

Interpretasi

Average Salary Comparison

This visualization compares the average financial compensation provided by five different companies. The primary focus is to observe the variance in wage standards across entities and how these figures correlate with the number of high-achieving employees (Top Performers) within each organization.

KPI Distribution & Top Performer Threshold

This chart maps the distribution of employee KPI scores to measure consistency and collective performance quality. The use of a threshold line at 90 serves as a critical indicator to distinguish the general employee population from those with exceptional productivity in each company.

5 Task 5: Monte Carlo Simulation: Pi & Probability

5.1 Task Description and Mathematical Concept

This task uses the Monte Carlo method to estimate the value of $\pi$ and analyze probabilities. The core concept is placing a unit circle inside a $2 \times 2$ square. The ratio of the circle’s area to the square’s area is $\pi/4$. By generating random points, we can calculate: \[\pi \approx 4 \times \frac{\text{points inside circle}}{\text{total points}}\]

5.2 Integrated Function Definition

In this section, build the monte_carlo_pi function, which integrates the entire simulation workflow. The function operates as follows:

Data Generation: Uses runif to create random $(x, y)$ coordinates.
Geometric Logic: Utilizes the Pythagorean theorem to filter points located inside the circle ($x^2 + y^2 \leq 1$).
Iteration & Analysis: Employs loops to monitor accuracy changes (convergence) and calculate probabilities within specific sub-areas.
Visualization: Generates two plots simultaneously: the spatial distribution of points and the stability trend of the $\pi$ value as the sample size increases.

monte_carlo_pi <- function(n_points) {
  
  # --- STEP 1: Generate Points ---
  set.seed(42)
  x <- runif(n_points, min = -1, max = 1)
  y <- runif(n_points, min = -1, max = 1)
  
  # --- STEP 2: Geometric Logic ---
  distance_sq <- x^2 + y^2
  inside_circle <- distance_sq <= 1
  
  # --- STEP 3: Pi Estimation ---
  pi_estimate <- 4 * sum(inside_circle) / n_points
  
  # --- STEP 4: Convergence Loop ---
  cat("=== π Convergence Over Iterations ===\n")
  checkpoints <- c(100, 500, 1000, 5000, n_points)
  for (n in checkpoints) {
    if (n <= n_points) {
      in_circle_n <- sum(x[1:n]^2 + y[1:n]^2 <= 1)
      pi_n        <- 4 * in_circle_n / n
      error_n     <- abs(pi_n - pi)
      cat(sprintf("  n = %6d | Estimate = %.6f | Error = %.6f\n", n, pi_n, error_n))
    }
  }
  
  # --- STEP 5: Probability Analysis ---
  in_subsquare     <- (x >= 0 & x <= 0.5) & (y >= 0 & y <= 0.5)
  prob_empirical   <- sum(in_subsquare) / n_points
  prob_theoretical <- (0.5 * 0.5) / (2 * 2)
  cat("\n=== Sub-square Probability [0, 0.5] ===\n")
  cat(sprintf("  Theoretical: %.6f | Empirical: %.6f\n", prob_theoretical, prob_empirical))
  
  # --- STEP 6: Build Plotly Objects ---
  plot_n       <- min(n_points, 3000)
  point_colors <- ifelse(inside_circle[1:plot_n], "#2196F3", "#F44336")
  point_labels <- ifelse(inside_circle[1:plot_n], "Inside Circle", "Outside Circle")
  
  theta    <- seq(0, 2 * pi, length.out = 500)
  circle_x <- cos(theta)
  circle_y <- sin(theta)
  
  p1 <- plot_ly() %>%
    add_trace(
      x = x[1:plot_n], y = y[1:plot_n],
      type = "scatter", mode = "markers",
      marker = list(color = point_colors, size = 3, opacity = 0.7),
      text = paste0(point_labels,
                    "<br>x = ", round(x[1:plot_n], 4),
                    "<br>y = ", round(y[1:plot_n], 4)),
      hoverinfo = "text",
      name = "Sample Points"
    ) %>%
    add_trace(
      x = circle_x, y = circle_y,
      type = "scatter", mode = "lines",
      line = list(color = "black", width = 2),
      hoverinfo = "none",
      name = "Unit Circle"
    ) %>%
    add_trace(
      x = c(0, 0.5, 0.5, 0, 0),
      y = c(0, 0, 0.5, 0.5, 0),
      type = "scatter", mode = "lines",
      line = list(color = "#FF9800", width = 2, dash = "dash"),
      hoverinfo = "none",
      name = "Sub-square [0–0.5]"
    ) %>%
    layout(
      title  = list(text = paste0("<b>Spatial Distribution</b> (n = ", n_points, ")"),
                    font = list(size = 14)),
      xaxis  = list(title = "X", range = c(-1.1, 1.1), scaleanchor = "y"),
      yaxis  = list(title = "Y", range = c(-1.1, 1.1)),
      legend = list(orientation = "h", y = -0.15),
      hovermode = "closest"
    )
  
  iter_seq     <- round(seq(100, n_points, length.out = 200))
  pi_estimates <- numeric(length(iter_seq))
  for (i in seq_along(iter_seq)) {
    n_i            <- iter_seq[i]
    in_c_i         <- sum(x[1:n_i]^2 + y[1:n_i]^2 <= 1)
    pi_estimates[i] <- 4 * in_c_i / n_i
  }
  errors <- abs(pi_estimates - pi)
  
  p2 <- plot_ly() %>%
    add_trace(
      x = iter_seq, y = pi_estimates,
      type = "scatter", mode = "lines",
      line = list(color = "#3F51B5", width = 1.5),
      text = paste0("n = ", iter_seq,
                    "<br>π̂ = ", round(pi_estimates, 6),
                    "<br>Error = ", round(errors, 6)),
      hoverinfo = "text",
      name = "Estimated π"
    ) %>%
    add_trace(
      x = c(min(iter_seq), max(iter_seq)),
      y = c(pi, pi),
      type = "scatter", mode = "lines",
      line = list(color = "#E53935", width = 2, dash = "dash"),
      hoverinfo = "none",
      name = paste0("True π = ", round(pi, 6))
    ) %>%
    layout(
      title     = list(text = "<b>Stability Trend of π</b>", font = list(size = 14)),
      xaxis     = list(title = "Sample Size (n)"),
      yaxis     = list(title = "Estimated π", range = c(pi - 0.3, pi + 0.3)),
      legend    = list(orientation = "h", y = -0.15),
      hovermode = "x unified"
    )
  
  invisible(list(
    pi_estimate    = pi_estimate,
    prob_empirical = prob_empirical,
    plot_spatial   = p1,        
    plot_converge  = p2
  ))
}

5.3 Execution and Result Visualization

he final step is to execute the simulation function. We adjust the graphical parameters using par(mfrow) so that both visualization results appear side-by-side.

## === π Convergence Over Iterations ===
##   n =    100 | Estimate = 2.920000 | Error = 0.221593
##   n =    500 | Estimate = 3.056000 | Error = 0.085593
##   n =   1000 | Estimate = 3.064000 | Error = 0.077593
##   n =   5000 | Estimate = 3.112800 | Error = 0.028793
##   n =  10000 | Estimate = 3.127200 | Error = 0.014393
## 
## === Sub-square Probability [0, 0.5] ===
##   Theoretical: 0.062500 | Empirical: 0.062900

Interpretation

Spatial Distribution Analysis

This plot displays 10,000 random points demonstrating the geometric estimation of $\pi$. The ratio of blue points (inside the unit circle) to red points (outside) represents the value of $\pi/4$. The orange dashed sub-square serves as a validation tool, where the empirical probability of 0.0629 closely aligns with the theoretical value of 0.0625, proving that the points are uniformly and accurately distributed.

2.Stability Trend Analysis

This graph visualizes the Law of Large Numbers. At lower sample sizes, the estimated $\pi$ (blue line) shows high volatility due to variance. However, as the sample size increases toward $n = 10,000$, the estimation stabilizes and converges toward the true value of $\pi$ (red dashed line). This trend confirms that the Monte Carlo method becomes significantly more reliable and precise as the number of iterations grows.

6 Task 6:Advanced Data Transformation & Feature Engineering

6.1 Library and dataset preparation

First, we need to load our essential libraries: dplyr for data manipulation, ggplot2 for visualization, and tidyr for tidying up data structures. We’ll also generate a synthetic employee dataset to work with.

Initial Dataset Structure
	Variable	Type	Sample
employee_id	employee_id	integer	1, 2, 3
age	age	integer	58, 22, 46
salary	salary	numeric	70000, 65200, 12700
experience	experience	integer	13, 18, 21
performance	performance	numeric	7.3, 1.4, 9.2
department	department	character	IT, IT, IT

6.2 Normalisation and Standarisation

In data analysis, it’s often necessary to bring different columns to the same scale. We’ll create two functions: Min-Max Normalization to scale values between $[0, 1]$, and Z-Score Standardization to transform the distribution to have a mean of 0 and a standard deviation of 1.

# FUNCTION 1: Min-Max Normalization [0, 1]
normalize_columns <- function(df) {
  non_num <- df[, !sapply(df, is.numeric), drop = FALSE]
  num_df  <- df[,  sapply(df, is.numeric), drop = FALSE]
  for (col in names(num_df)) {
    x             <- num_df[[col]]
    num_df[[col]] <- (x - min(x)) / (max(x) - min(x))
  }
  cbind(non_num, num_df)
}

# FUNCTION 2: Z-Score Standardization
z_score <- function(df) {
  non_num <- df[, !sapply(df, is.numeric), drop = FALSE]
  num_df  <- df[,  sapply(df, is.numeric), drop = FALSE]
  for (col in names(num_df)) {
    x             <- num_df[[col]]
    num_df[[col]] <- (x - mean(x)) / sd(x)
  }
  cbind(non_num, num_df)
}

# Apply functions
df_norm   <- normalize_columns(df)
df_zscore <- z_score(df)

# Build output tables
norm_summary <- as.data.frame(t(round(as.numeric(summary(df_norm$salary)), 6)))
colnames(norm_summary) <- c("Min", "Q1", "Median", "Mean", "Q3", "Max")

zscore_summary <- as.data.frame(t(round(as.numeric(summary(df_zscore$salary)), 6)))
colnames(zscore_summary) <- c("Min", "Q1", "Median", "Mean", "Q3", "Max")

# OUTPUT
knitr::kable(norm_summary, format = "simple", align = "c",
             caption = "Salary Statistics After Normalization")

Salary Statistics After Normalization
Min	Q1	Median	Mean	Q3	Max
0	0.443611	0.547222	0.548767	0.6625	1

knitr::kable(zscore_summary, format = "simple", align = "c",
             caption = "Salary Statistics After Z-Score")

Salary Statistics After Z-Score
Min	Q1	Median	Mean	Q3	Max
-3.283284	-0.629148	-0.00924	0	0.680469	2.69974

6.3 Feature Engineering

This section is where we build new columns based on business logic. We will categorize performance levels, create salary brackets, flag senior employees, and bin ages into specific groups.

# FUNCTION 3: create_features(df)
create_features <- function(df) {
  df <- df |>
    mutate(
      # Feature 1: Performance category
      performance_category = case_when(
        performance >= 8  ~ "Excellent",
        performance >= 6  ~ "Good",
        performance >= 4  ~ "Average",
        TRUE               ~ "Poor"
      ),

      # Feature 2: Salary bracket
      salary_bracket = case_when(
        salary >= 80000  ~ "High",
        salary >= 50000  ~ "Mid",
        salary >= 30000  ~ "Low",
        TRUE             ~ "Very Low"
      ),

      # Feature 3: Seniority flag
      is_senior = ifelse(experience >= 10, "Senior", "Junior"),

      # Feature 4: Age group
      age_group = cut(age,
                      breaks = c(20, 30, 40, 50, Inf),
                      labels = c("20s", "30s", "40s", "50s+"),
                      right  = FALSE)
    )
  return(df)
}

# Apply function
df_feat <- create_features(df)

# Build output tables
perf_df <- as.data.frame(table(df_feat$performance_category))
colnames(perf_df) <- c("Category", "Count")

salary_df <- as.data.frame(table(df_feat$salary_bracket))
colnames(salary_df) <- c("Bracket", "Count")

# OUTPUT
knitr::kable(perf_df, format = "simple", align = "c",
             caption = "Performance Category Distribution")

Performance Category Distribution
Category	Count
Average	49
Excellent	49
Good	46
Poor	56

knitr::kable(salary_df, format = "simple", align = "c",
             caption = "Salary Bracket Distribution")

Salary Bracket Distribution
Bracket	Count
High	8
Low	73
Mid	111
Very Low	8

6.4 Analisis Distribusi dan Skewness

We’ll use a loop to compare statistics before and after standardization, and check if any columns exhibit extreme skewness that might require further transformation.

Distribution Comparison: Before vs After Z-Score
Column	Before_Mean	Before_SD	ZScore_SD	Skewness_Note
SALARY	54189.00	15042.56	1	✓ Normal: -0.11
AGE	42.23	11.41	1	✓ Normal: -0.22
EXPERIENCE	19.89	10.29	1	✓ Normal: -0.2
PERFORMANCE	5.60	2.60	1	✓ Normal: -0.19

6.5 Data Visualization

Visualizations help us intuitively understand how the data has changed. We’ll create a histogram to observe the effect of the Z-Score, a boxplot to relate salary to performance, and a comparison for Min-Max normalization.

Interpretasi

Salary Distribution: Original vs. Z-Score

The Z-Score transformation successfully mapped salaries into a standardized scale (mean of 0, SD of 1) without altering the distribution’s shape. This is highly effective for normalizing data with wide value ranges, ensuring that machine learning models do not assign disproportionate weight to specific variables based solely on their scale.

Salary per Performance Category

This chart shows a relatively consistent salary variation across all performance categories. However, the presence of an outlier in the “Excellent” group reveals a data anomaly: some high-performing employees are receiving compensation below the average. This indicates that performance is not the sole primary determinant of salary levels in this dataset.

Age Distribution: Original vs. Min-Max

The Min-Max Normalization method successfully mapped age data into a range of $[0, 1]$ while maintaining the integrity of the original frequency distribution. This step is crucial for nsuring that data processing algorithms can identify age patterns more stably and efficiently, without being affected by differences in numerical units.

7 Task 7: Mini Project: Company KPI Dashboard & Simulation

7.1 Environment Setup & Data Generation

The initial stage begins by loading the necessary libraries for data processing and visualization. A simulation function named generate_company_data is defined to create a synthetic dataset. This function implements nested loops to generate profiles for 150 employees across 7 different companies, resulting in a total of 1,050 rows of data automatically.

library(dplyr)
library(plotly)

# Set seed for reproducibility
set.seed(42)

# Function to generate simulated multi-company data
generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data    <- list()
  
  for (comp in 1:n_company) {
    company_id      <- paste0("Company_", LETTERS[comp])
    company_records <- list()
    
    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)
      top_performer     <- ifelse(KPI_score > 90, "Yes", "No")
      
      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }
  
  final_df           <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

# Execute data generation
employee_data <- generate_company_data(n_company = 5, n_employees = 150)

7.2 KPI Tier Categorization

The categorization process is performed using a combination of loops and conditional logic to sort each employee into four distinct KPI tiers: Elite, High Performer, Average, and Needs Improvement. This conditional logic enables precise classification based on the previously generated KPI scores.

# Initialize character vector for tiers
kpi_tier <- character(nrow(employee_data))

# Loop through each row to determine the tier
for (i in 1:nrow(employee_data)) {
  score <- employee_data$KPI_score[i]
  
  if (score > 90) {
    kpi_tier[i] <- "Elite"
  } else if (score >= 75) {
    kpi_tier[i] <- "High Performer"
  } else if (score >= 60) {
    kpi_tier[i] <- "Average"
  } else {
    kpi_tier[i] <- "Needs Improvement"
  }
}

# Assign the new column to the main dataframe
employee_data$kpi_tier <- kpi_tier

7.3 Statistical Summaries

The data is then processed to generate statistical summaries at both company and department levels. Key metrics such as average salary, performance, and the number of top performers are calculated using dplyr aggregation functions to provide a comprehensive overview for management.

# Company-level summary
company_summary <- employee_data %>%
  group_by(company_id) %>%
  summarise(
    avg_salary      = round(mean(salary), 2),
    avg_KPI         = round(mean(KPI_score), 2),
    avg_performance = round(mean(performance_score), 2),
    top_performers  = sum(top_performer == "Yes"),
    elite_count     = sum(kpi_tier == "Elite"),
    total_employees = n(),
    .groups = "drop"
  )

# Department-level analysis
dept_summary <- employee_data %>%
  group_by(department) %>%
  summarise(
    avg_salary      = round(mean(salary), 2),
    avg_KPI         = round(mean(KPI_score), 2),
    total_employees = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_KPI))

7.4 Visualization

Data visualizations are developed using plotly to provide an interactive aspect to the report. This section includes a Grouped Bar Chart for salary and KPI comparison, as well as a Scatter Plot equipped with linear regression lines per company to visually analyze correlations between variables.

Interpretasi

Salary & KPI Comparison per Company

Average salaries across all companies remain relatively stable between $8,000 and $9,500, with Company_D recording the highest. However, there is an extreme scale disparity in the visualization, where KPI values (0–100) appear insignificant when compared directly against salary figures that reach thousands of dollars.

KPI Score vs Salary Correlation

The data reveals no significant correlation between performance (KPI) and compensation (Salary). This is evidenced by the flat trend lines and randomized data points, indicating that employee performance improvements are not yet a primary factor in salary adjustments across these companies.

8 Task 8: Automated Report Generation

8.1 Library Initialization and Data Simulation

This section loads the dplyr library for data manipulation and plotly for interactive charts. We also define a function to automatically simulate employee data for multiple companies.

library(dplyr)
library(plotly)
library(htmltools)

set.seed(42)

generate_company_data <- function(n_company, n_employees) {
  departments <- c("HR", "Finance", "Engineering", "Marketing", "Operations")
  all_data    <- list()
  
  for (comp in 1:n_company) {
    company_id      <- paste0("Company_", LETTERS[comp])
    company_records <- list()
    
    for (emp in 1:n_employees) {
      employee_id       <- paste0(company_id, "_EMP", sprintf("%03d", emp))
      salary            <- round(runif(1, min = 3000, max = 15000), 2)
      department        <- sample(departments, 1)
      performance_score <- round(runif(1, min = 50, max = 100), 1)
      KPI_score         <- round(runif(1, min = 40, max = 100), 1)
      top_performer     <- ifelse(KPI_score > 90, "Yes", "No")
      
      company_records[[emp]] <- data.frame(
        company_id        = company_id,
        employee_id       = employee_id,
        salary            = salary,
        department        = department,
        performance_score = performance_score,
        KPI_score         = KPI_score,
        top_performer     = top_performer,
        stringsAsFactors  = FALSE
      )
    }
    all_data[[comp]] <- do.call(rbind, company_records)
  }
  
  final_df           <- do.call(rbind, all_data)
  rownames(final_df) <- NULL
  return(final_df)
}

employee_data <- generate_company_data(n_company = 5, n_employees = 150)


employee_data$kpi_tier <- as.character(cut(employee_data$KPI_score,
                             breaks = c(0, 60, 75, 90, 100),
                             labels = c("Needs Improvement", "Average", "High Performer", "Elite"),
                             include.lowest = TRUE))

8.2 Summary Statistics and HTML Table Builder

In this section, we create functions to calculate the company’s key statistics and a helper function to convert data frames into clean HTML table formats.

generate_summary_stats <- function(df_company) {
  list(
    total_employees = nrow(df_company),
    avg_salary      = round(mean(df_company$salary), 2),
    max_salary      = round(max(df_company$salary), 2),
    min_salary      = round(min(df_company$salary), 2),
    avg_KPI         = round(mean(df_company$KPI_score), 2),
    avg_performance = round(mean(df_company$performance_score), 2),
    top_performers  = sum(df_company$top_performer == "Yes"),
    elite_count     = sum(df_company$kpi_tier == "Elite"),
    hi_count        = sum(df_company$kpi_tier == "High Performer"),
    avg_count       = sum(df_company$kpi_tier == "Average"),
    low_count       = sum(df_company$kpi_tier == "Needs Improvement")
  )
}

8.3 Main Execution Loop

This is the final step to automatically execute the entire process for all registered companies.

export_csv <- function(df_company, comp_name, output_dir = ".") {
  file_name <- file.path(output_dir, paste0("Data_", comp_name, ".csv"))
  write.csv(df_company, file_name, row.names = FALSE)
  message("CSV exported : ", file_name)
  invisible(file_name)
}

8.4 Visualization

This function generates interactive visualizations: a salary bar chart, a KPI distribution pie chart, and a scatter plot showing the relationship between KPI and salary.

8.5 Report: Company_A

Total Employees : 150
Avg Salary : $ 8,783.24
Avg KPI : 70.66
Top Performers : 32

Company_A — Department Summary
department	Employees	Avg_Salary	Avg_KPI	Top_Performers
Engineering	26	9017.61	67.11	8
Finance	34	8126.53	71.64	6
HR	34	9020.41	70.74	6
Marketing	26	8426.28	71.80	6
Operations	30	9364.97	71.55	6

8.6 Report: Company_B

Total Employees : 150
Avg Salary : $ 8,418.26
Avg KPI : 70.56
Top Performers : 26

Company_B — Department Summary
department	Employees	Avg_Salary	Avg_KPI	Top_Performers
Engineering	28	8951.44	69.21	3
Finance	29	8189.87	73.97	8
HR	34	8432.99	74.42	6
Marketing	31	8771.35	69.66	7
Operations	28	7712.84	64.69	2

8.7 Report: Company_C

Total Employees : 150
Avg Salary : $ 9,097.39
Avg KPI : 70.41
Top Performers : 25

Company_C — Department Summary
department	Employees	Avg_Salary	Avg_KPI	Top_Performers
Engineering	38	9003.82	74.77	9
Finance	25	8291.09	70.78	4
HR	34	9575.61	68.77	3
Marketing	30	9314.98	68.02	6
Operations	23	9137.64	68.37	3

8.8 Report: Company_D

Total Employees : 150
Avg Salary : $ 9,398.08
Avg KPI : 69.13
Top Performers : 24

Company_D — Department Summary
department	Employees	Avg_Salary	Avg_KPI	Top_Performers
Engineering	24	9467.59	70.79	4
Finance	30	10891.77	72.67	8
HR	25	7707.59	69.72	4
Marketing	39	8701.53	68.10	5
Operations	32	10115.24	65.34	3

8.9 Report: Company_E

Total Employees : 150
Avg Salary : $ 9,324.65
Avg KPI : 71.53
Top Performers : 26

Company_E — Department Summary
department	Employees	Avg_Salary	Avg_KPI	Top_Performers
Engineering	28	9324.51	71.44	7
Finance	27	9584.96	74.61	5
HR	33	8833.76	71.15	5
Marketing	27	8849.06	71.36	4
Operations	35	9953.68	69.73	5

Interpretasi

Average Salary per Company (Bar Chart) Rata-rata gaji di seluruh perusahaan berada pada rentang kompetitif $8.000 hingga $10.000, dengan Company D sebagai pembayar tertinggi dan Company B terendah, yang mengindikasikan adanya potensi inefisiensi biaya jika tidak dibarengi dengan performa yang sebanding.
KPI Tier Distribution (Donut Chart) Kualitas SDM secara agregat menunjukkan kondisi kritis karena kategori Needs Improvement mendominasi sebesar 32,4%, sementara kelompok performa tinggi (Elite & High Performer) hanya mencapai 44,1%, sehingga diperlukan evaluasi strategi rekrutmen dan pengembangan karyawan.
Average KPI Score per Company (Line Chart) Tren performa rata-rata di semua perusahaan cenderung stagnan di angka 70-an tanpa adanya perbedaan signifikan antar entitas, mempertegas temuan bahwa besaran gaji (khususnya pada Company D) belum berkorelasi positif terhadap peningkatan skor KPI.

Formula	Persamaan
Linear	\(y = 2x + 1\)
Quadratic	\(y = x^2 + 3x + 2\)
Cubic	\(y = x^3 - 2x^2 + x\)
Exponential	\(y = e^{0.3x}\)

Assignment Week 5

Data Science Programming 1